2. Text indexing
T
Preprocess a text string T (length n) to answer pattern matching queries. . .
ba b c a b a cb a a b a
n
3. Text indexing
T
Preprocess a text string T (length n) to answer pattern matching queries. . .
ba b c a b a cb a a b a
n
After preprocessing, a query is a pattern P (length m),
P a b a
m
4. Text indexing
T
Preprocess a text string T (length n) to answer pattern matching queries. . .
ba b c a b a cb a a b a
n
After preprocessing, a query is a pattern P (length m),
P a b a
m
the output is a list of all matches in T.
5. Text indexing
T
Preprocess a text string T (length n) to answer pattern matching queries. . .
ba b c a b a cb a a b a
n
After preprocessing, a query is a pattern P (length m),
P a b a
m
the output is a list of all matches in T.
e.g. 4, 6, 10
4 6 10
6. Text indexing
T
Preprocess a text string T (length n) to answer pattern matching queries. . .
ba b c a b a cb a a b a
n
After preprocessing, a query is a pattern P (length m),
P a b a
m
the output is a list of all matches in T.
• Last lecture we saw the that text indexing problem can be solved using a suffix tree
e.g. 4, 6, 10
4 6 10
which uses O(n) space (when it’s stored compacted)
7. Text indexing
T
Preprocess a text string T (length n) to answer pattern matching queries. . .
ba b c a b a cb a a b a
n
After preprocessing, a query is a pattern P (length m),
P a b a
m
the output is a list of all matches in T.
• Last lecture we saw the that text indexing problem can be solved using a suffix tree
• Queries take O(m + occ) time when the alphabet size is constant
- occ is the number of occurences (matches)
e.g. 4, 6, 10
4 6 10
which uses O(n) space (when it’s stored compacted)
8. Text indexing
T
Preprocess a text string T (length n) to answer pattern matching queries. . .
ba b c a b a cb a a b a
n
After preprocessing, a query is a pattern P (length m),
P a b a
m
the output is a list of all matches in T.
• Last lecture we saw the that text indexing problem can be solved using a suffix tree
• Queries take O(m + occ) time when the alphabet size is constant
- occ is the number of occurences (matches)
• Suffix trees can be constructed in O(n) time (but we only saw how to achieve O(n2) time)
e.g. 4, 6, 10
4 6 10
which uses O(n) space (when it’s stored compacted)
10. The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
11. The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa snsuffix
1
12. The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
13. The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
14. The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
15. The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
16. The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a
17. The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a
18. The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a
19. The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c<
20. The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c<
21. The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c<
22. The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c< b c< a
23. The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c< b c< a
24. The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c< b c< a
(in a ‘tie’, the shorter string is smaller)
25. The suffix array
T b n aaT a sn
n 0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
Sort the suffixes
lexicographically
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c< b c< a
(in a ‘tie’, the shorter string is smaller)
26. The suffix array
T b n aaT a sn
n
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c< b c< a
(in a ‘tie’, the shorter string is smaller)
27. The suffix array
T b n aaT a sn
n
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c< b c< a
(in a ‘tie’, the shorter string is smaller)
28. The suffix array
T b n aaT a sn
n
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c< b c< a
(in a ‘tie’, the shorter string is smaller)
29. The suffix array
T b n aaT a sn
n
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
In lexicographical ordering we sort strings based on the first symbol that differs:
b a<a a b c< b c< a
(in a ‘tie’, the shorter string is smaller)
30. The suffix array
T b n aaT a sn
n
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
b a<a a b c< b c< a
(in a ‘tie’, the shorter string is smaller)
just a fancy name for the order the strings would appear in the dictionary
In lexicographical ordering we sort strings based on the first symbol that differs:
31. The suffix array
T b n aaT a sn
n
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
• The symbols themselves must have an order
throughout we will use alphabetical order
If the symbols don’t have a natural order, we use their binary representation in memory
b a<a a b c< b c< a
(in a ‘tie’, the shorter string is smaller)
just a fancy name for the order the strings would appear in the dictionary
In lexicographical ordering we sort strings based on the first symbol that differs:
32. The suffix array
T b n aaT a sn
n
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
33. The suffix array
T b n aaT a sn
n
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
10 2 3 4 5 6
34. The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
10 2 3 4 5 6
35. The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
10 2 3 4 5 6
36. The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
10 2 3 4 5 6
37. The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
The suffix array is much smaller than the suffix tree (in terms of constants)
10 2 3 4 5 6
38. The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes
lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
The suffix array is much smaller than the suffix tree (in terms of constants)
10 2 3 4 5 6
39. From Suffix Trees to Suffix Arrays
a
s$
nas$
nas$
nas$
s$
na
s$
bananas$
1 3
5
0
2 4
6
T b n aaT a sn
n
Suffix Array
Suffix Tree
1 0 625 43
n
Recall that we can get get the Suffix Array from the Suffix Tree
1 3
5
0
2 4
6
10 2 3 4 5 6
using depth-first search in O(n) time
40. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
157 100 1 43 112 1396 1485 12
c db cb a c bc c d adb
41. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
c db cb a c bc c d adb
m
42. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
Find an occurence of P
using binary search
43. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
occurences could start anywhere
Find an occurence of P
using binary search
44. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
occurences could start anywhere
Find an occurence of P
using binary search
45. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
occurences could start anywhere
Find an occurence of P
using binary search
46. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c d adb>
c db cb a c bc c d adb
m
c db cb
occurences could start anywhere
Find an occurence of P
using binary search
47. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c d adb>
c db cb a c bc c d adb
m
c db cb
occurences could start anywhere
Find an occurence of P
using binary search
48. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c d adb>
c db cb a c bc c d adb
m
c db cb
occurences could start anywhere
Find an occurence of P
using binary search
49. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c d adb>
c db cb a c bc c d adb
m
c db cb
occurences must start in here
Find an occurence of P
using binary search
50. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
occurences must start in here
Find an occurence of P
using binary search
51. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
occurences must start in here
Find an occurence of P
using binary search
52. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
occurences must start in here
Find an occurence of P
using binary search
53. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb < c bc db
c db cb a c bc c d adb
m
occurences must start in here
Find an occurence of P
using binary search
54. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb < c bc db
c db cb a c bc c d adb
m
occurences must start in here
Find an occurence of P
using binary search
55. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb < c bc db
c db cb a c bc c d adb
m
occurences must start in here
Find an occurence of P
using binary search
56. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
occurences must start in here
Find an occurence of P
using binary search
57. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
occurences must start in here
Find an occurence of P
using binary search
58. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
=
c db cb a c bc c d adb
c db cb
m
c db cb
occurences must start in here
Find an occurence of P
using binary search
59. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
=
c db cb a c bc c d adb
c db cb
m
c db cb
occurences must start in here
we found a match!
Find an occurence of P
using binary search
60. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
m
Find an occurence of P
using binary search
61. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
How long does this take?
m
Find an occurence of P
using binary search
62. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
How long does this take?
O(m) time to compare two strings
m
Find an occurence of P
using binary search
63. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
How long does this take?
O(m) time to compare two strings
so O(m log n) time in total
m
Find an occurence of P
using binary search
64. Searching in the Suffix Array
0
1
2
3
5
6
7
8
9
10
11
12
13
14
15
4
b c dba cb a c dbc cb d a
c dba cb a c bc c d adb
db cb a c bc c d adb
d cb a c bc c d adb
d c a c bc c d adb
c a c bc c d adb
a c bc c d adb
c bc c d adb
c b c d adb
c d a
d a
a
c d ad
c d adb
b c d adb
2
5
6
8
9
12
13
14
d c a c bc c d adb
c a c bc c d adb
c b c d adb
c d a
d a
c d ad
b c dbT a cb a c dbc cb d a
n
15 7 100Suffix Array 1 43 11 2 1396 148 5 12
c db cbP
157 100 1 43 112 1396 1485 12
find
Key Idea:
c db cb a c bc c d adb
How long does this take?
O(m) time to compare two strings
so O(m log n) time in total
This method generalises to O(m log n + occ) time
m
to find all occ occurences.
Find an occurence of P
using binary search
by continuing the binary search
(we will skip the details)
65. The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
P aa n
m
an
Finding an occurrence of a pattern (length m) takes O(m log n) time
Finding all occurrences takes O(m log n + occ) time
where occ is the number of occurences
66. The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
P aa n
m
an
Finding an occurrence of a pattern (length m) takes O(m log n) time
Finding all occurrences takes O(m log n + occ) time
where occ is the number of occurences
This can be further improved to O(m + log n + occ) time
(using LCP queries which we will see in a future lecture)
67. The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
P aa n
m
an
Finding an occurrence of a pattern (length m) takes O(m log n) time
Finding all occurrences takes O(m log n + occ) time
where occ is the number of occurences
Do we really need to build the suffix tree to construct the suffix array?
This can be further improved to O(m + log n + occ) time
(using LCP queries which we will see in a future lecture)
68. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
69. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
70. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
71. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
72. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
73. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
74. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
R1 is split into blocks of length 3
75. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
76. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
77. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
R2 is also split into blocks of length 3
78. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
79. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
80. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
81. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1
82. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2
83. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 3
84. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34
85. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 5
86. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56
87. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
88. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
we assume that the bit representation of each symbol uses O(log n) bits.
(which is a common and realistic assumption)
This can be done by sorting the blocks in O(n) time using radix sort
89. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
90. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
91. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
92. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
93. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
compute the suffix array of R :
1 2 3 4 5 6 70
94. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
compute the suffix array of R :
1 2 3 4 5 6 70
0 1 6 4 2 5 3 7
95. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
compute the suffix array of R :
1 2 3 4 5 6 70
0 1 6 4 2 5 3 7
How do we compute the suffix array for R ?
Recursion!
(Notice that R has length 2n/3)
96. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
compute the suffix array of R :
1 2 3 4 5 6 70
0 1 6 4 2 5 3 7
97. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
compute the suffix array of R :
what use1 2 3 4 5 6 70
0 1 6 4 2 5 3 7
is this?!
98. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $R1 =
b d ob a a b b a d $R2 = $
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
Introduce a new
“filler symbol” $.
B2 contains indices with
i mod 3 = 2
Number the blocks in lexicographical order ($ is the smallest symbol)
1 2 4 34 56 7
let R = 1 2 4 34 56 7
compute the suffix array of R :
what use1 2 3 4 5 6 70
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
is this?!
99. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
is this?!
100. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2
is this?!
101. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2
a b d ob a a b b a d
d oa b b a d
is this?!
1
5
102. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
a b d ob a a b b a d
d oa b b a d
is this?!
1
5
103. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
a b d ob a a b b a d
d oa b b a d
is this?!
1
5
104. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
a b d ob a a b b a d
d oa b b a d
a b d ob a a b b a d $ b d ob a a b b a d $ $
is this?!
1
5
105. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
a b d ob a a b b a d
d oa b b a d
a b d ob a a b b a d $ b d ob a a b b a d $ $
is this?!
1
5
106. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
a b d ob a a b b a d
d oa b b a d
a b d ob a a b b a d $ b d ob a a b b a d $ $
d oa b b a d $ $
is this?!
1
5
107. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
a b d ob a a b b a d
d oa b b a d
a b d ob a a b b a d $ b d ob a a b b a d $ $
d oa b b a d $ $
their order is given by the suffix array of R :
is this?!
1
5
108. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
a b d ob a a b b a d
d oa b b a d
a b d ob a a b b a d $ b d ob a a b b a d $ $
d oa b b a d $ $
their order is given by the suffix array of R :
is this?!
1
5
109. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
a b d ob a a b b a d
d oa b b a d
a b d ob a a b b a d $ b d ob a a b b a d $ $
d oa b b a d $ $
their order is given by the suffix array of R :
is this?!
1
5
Suffix is smaller
than suffix
1
55
110. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
111. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
ob b a d7
112. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
ob b a d7
ob b a d $ b d ob a a b b a d $ $
113. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
ob b a d7
d oa b b a d $ $
ob b a d $ b d ob a a b b a d $ $
114. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
ob b a d7
d oa b b a d $ $
ob b a d $ b d ob a a b b a d $ $
115. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
ob b a d7
d oa b b a d $ $
ob b a d $ b d ob a a b b a d $ $
116. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
ob b a d7
d oa b b a d $ $
ob b a d $ b d ob a a b b a d $ $
Suffix is smaller
than suffix
7
5
117. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
118. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
b d ob a a b b a d2
119. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
b d ob a a b b a d2
d oa b b a d $ $
b d ob a a b b a d $ $
120. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
b d ob a a b b a d2
d oa b b a d $ $
b d ob a a b b a d $ $
121. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
b d ob a a b b a d2
d oa b b a d $ $
b d ob a a b b a d $ $
122. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
Take any two suffixes in B1 ∪ B2 and find them in R
their order is given by the suffix array of R :
is this?!
d oa b b a d5
b d ob a a b b a d2
d oa b b a d $ $
b d ob a a b b a d $ $
Suffix is smaller
than suffix
2
5
123. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
compute the suffix array of R :
what use
0 1 6 4 2 5 3 7
0 1 2 3 4 5 6 7
is this?!
124. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
0 1 2 3 4 5 6 7
The suffix array of R : 0 1 6 4 2 5 3 7
125. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
0 1 2 3 4 5 6 7
The suffix array of R : 0 1 6 4 2 5 3 7
126. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
0 1 2 3 4 5 6 7
The suffix array of R : 0 1 6 4 2 5 3 7
after relabelling,
127. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
0 1 2 3 4 5 6 7
The suffix array of R : 0 1 6 4 2 5 3 7
after relabelling, 1 24 111058 7
128. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
0 1 2 3 4 5 6 7
The suffix array of R : 0 1 6 4 2 5 3 7
after relabelling,
we have the suffix array of just the suffixes from B1 ∪ B2
1 24 111058 7
129. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
0 1 2 3 4 5 6 7
The suffix array of R : 0 1 6 4 2 5 3 7
after relabelling,
we have the suffix array of just the suffixes from B1 ∪ B2
1 24 111058 7
130. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
a b d ob a a b b a d $ b d ob a a b b a d $ $
Concatenate R1 and R2 to obtain R:
B2 contains indices with
i mod 3 = 2
0 1 2 3 4 5 6 7
The suffix array of R : 0 1 6 4 2 5 3 7
after relabelling,
we have the suffix array of just the suffixes from B1 ∪ B2
1 24 111058 7
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
131. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
132. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2 1 24 111058 7
133. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
134. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
1 24 111058 7
135. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
1 24 111058 7
136. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
suffix 0 is a y
1 24 111058 7
followed by suffix 1
137. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
1 24 111058 7
138. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
1 24 111058 7
139. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1)
1 24 111058 7
(the ranks are given by the array below)
rank:
140. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
(y, 0)
(b, 1)
(a, 4)
(a, 6)
Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1)
1 24 111058 7
(the ranks are given by the array below)
rank:
141. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
(y, 0)
(b, 1)
(a, 4)
(a, 6)
Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1)
1 24 111058 7
(the ranks are given by the array below)
rank:
We then sort in O(n) time using radix sort
142. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1)
1 24 111058 7
(the ranks are given by the array below)
rank:
We then sort in O(n) time using radix sort
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
(y, 0)
(b, 1)
(a, 4)
(a, 6)
143. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
How do we find the ordering of the suffixes from B0? (where i mod 3 = 0)
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1)
1 24 111058 7
(the ranks are given by the array below)
rank:
We then sort in O(n) time using radix sort
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
(y, 0)
(b, 1)
(a, 4)
(a, 6)
144. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1)
1 24 111058 7
(the ranks are given by the array below)
rank:
We then sort in O(n) time using radix sort
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
(y, 0)
(b, 1)
(a, 4)
(a, 6)
145. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1)
1 24 111058 7
(the ranks are given by the array below)
rank:
We then sort in O(n) time using radix sort
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
(y, 0)
(b, 1)
(a, 4)
(a, 6)
Suffix array for just B0 6 09 3
146. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
Each suffix i ∈ B0 is represented by (T[i], r) where r is the rank of suffix (i + 1)
1 24 111058 7
(the ranks are given by the array below)
rank:
We then sort in O(n) time using radix sort
y a b d ob a a b b a d
d ob a a b b a d
oa b b a d
oa d
0
3
6
9
= + 1y
= + 4b
= + 7a
= +a 10
(y, 0)
(b, 1)
(a, 4)
(a, 6)
Suffix array for just B0 6 09 3 How do we merge these?
147. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
148. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . .
149. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . .
which is smaller, suffix or ?1 6
150. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . .
which is smaller, suffix or ?
+ 7a=
1 6
6
1 +a= 2
151. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . .
which is smaller, suffix or ?
+ 7a=
1 6
6
1 +a= 2
(a, 4)
(a, 3)
152. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . .
which is smaller, suffix or ?
+ 7a=
1 6
6
1 +a= 2
(a, 4)
(a, 3)
It takes O(1) time to decide
that 1 is smaller
153. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . .
which is smaller, suffix or ?
+ 7a=
1 6
6
1 +a= 2
(a, 4)
(a, 3)
It takes O(1) time to decide
that 1 is smaller
1
154. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1
which is smaller, suffix or4 6 ?
155. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1
which is smaller, suffix or4 6
+ 7a=6
4 +a= 5
?
156. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1
which is smaller, suffix or4 6
+ 7a=6
4 +a= 5
(a, 4)
(a, 5)
?
157. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1
which is smaller, suffix or4 6
+ 7a=6
4 +a= 5
(a, 4)
(a, 5)
Again, it takes O(1) time to decide
that is smaller6
?
158. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6
159. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6
which is smaller, suffix or4 9 ?
160. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6
)4(which is smaller, suffix or4 9 ? is smaller
161. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6 4
)4(which is smaller, suffix or4 9 ? is smaller
162. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6 4
8which is smaller, suffix or 9 ?
163. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6 4
8
+a=
+b=
9 10
8 9
which is smaller, suffix or 9 ?
164. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6 4
8
+a=
+b=
9 10
8 9
Uh oh! how do we compare 9 10to ?
which is smaller, suffix or 9 ?
165. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6 4
8
Uh oh! how do we compare 9 10to ?
+a=
+b=
9
8
d
a
+
+
11
10
It still takes O(1) time to decide
that is smaller9
which is smaller, suffix or 9 ?
166. The DC3 method
y a b d ob a a b b a d
1 2 3 4 5 6 7 8 9 10 110
T =
B1 contains indices with
i mod 3 = 1
B2 contains indices with
i mod 3 = 2
Suffix array for just B1 ∪ B2
1 2 3 4 5 6 70
1 24 111058 7
Suffix array for just B0 6 09 3 How do we merge these?
Merge them like in mergesort. . . 1 6 4
Overall this merging phase takes O(n) time
(because processing each suffix takes O(1) time)
168. The DC3 method
Theorem
The DC3 algorithm constructs a suffix array in O(n) time.
Proof
Suppose T(n) is the running time. We have
T(n) = T(2n/3) + O(n)
169. The DC3 method
Theorem
The DC3 algorithm constructs a suffix array in O(n) time.
Proof
Suppose T(n) is the running time. We have
T(n) = T(2n/3) + O(n)
radix sorting and merging
recursion to construct
a suffix array of size 2n/3
170. The DC3 method
Theorem
The DC3 algorithm constructs a suffix array in O(n) time.
Proof
Suppose T(n) is the running time. We have
T(n) = T(2n/3) + O(n)
Solving this recurrence gives T(n) ∈ O(n).
radix sorting and merging
recursion to construct
a suffix array of size 2n/3
171. The suffix array
T b n aaT a sn
n
Suffix Array 1 0 625 4
Sort the suffixes lexicographically
0 b n aaa sn
n aa1 a sn
2 n aa sn
4 a sn
5 a s
6 s
3 aa sn
3
n
P aa n
m
an
Finding an occurrence of a pattern (length m) takes O(m log n) time
Finding all occurrences takes O(m log n + occ) time
where occ is the number of occurences
We can construct the suffix array in O(n) time
This can be further improved to O(m + log n + occ) time
(using LCP queries which we will see in a future lecture)