SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
.
                        Algorithm of Suffix Tree
                           by farseerfc@gmail.com
.

                                   Jiachen Yang1
                           1 Research   Student in Osaka University


                               November 12, 2011




                                                                .     .   .        .      .       .

    jc-yang (Osaka-U)                       SuffixTree                         November 12, 2011       1 / 25
Part Outline


.
1   What is Suffix Tree

.
2   History & Naïve Algorithm

.
3   Optimization of Naïve Algorithm

.
4   Examples & Analysis




                                            .   .   .        .      .       .

      jc-yang (Osaka-U)         SuffixTree               November 12, 2011       2 / 25
What can we do with suffix tree?
.
                                           Find properties of the strings
Linear algorithms for exact string           . Find the longest common substrings of the
                                             1
matching
                                               string Si and Sj in θ (ni + nj ) time .
    like KMP                                 . Find all maximal pairs, maximal repeats or
                                             2

Search for strings                             supermaximal repeats in θ (n + z) time, if there
  . Check if a string P of length m is
  1                                            are z such repeats .
                                             . Find the Lempel-Ziv decomposition in θ (n)
                                             3
    a substring in O(m) time.
  . Find all z occurrences of the
  2                                            time .
                                             . Find the longest repeated substrings in θ (n)
    patterns P1 , · · · , Pq of total        4


    length m as substrings in                  time.
                                             . Find the most frequently occurring substrings
                                             5
    O(m + z) time.
  . Search for a regular expression P
  3                                            of a minimum length in θ (n) time.
                                             . Find the shortest strings from ∑ that do not
                                             6
    in time expected sublinear in n .
  . Find for each suffix of a pattern
  4                                            occur in D, in O(n + z) time, if there are z such
    P, the length of the longest               strings.
                                             . Find the shortest substrings occurring only
                                             7
    match between a prefix of
    P[i . . . m] and a substring in D in       once in θ (n) time.
                                             . Find, for each i, the shortest substrings of Si
    θ (m) time .                             8

  . ...
  5                                            not occurring elsewhere in D in θ (n) time.
                                             . ...
                                             9


                                                               .     .     .        .      .       .

      jc-yang (Osaka-U)                     SuffixTree                          November 12, 2011       3 / 25
Trie, Radix Tree, Suffix Trie & Suffix Tree

         trie1                    is a dictionary tree.
                   (AKA prefix tree)

                             stores a set of words.
                             each node represents a character except that root is empty string.
                             words with common prefix share same parent nodes.
                             minimal deterministic finite automaton that accepts all words.
  radix tree                            is a trie with compressed chain of nodes.
                   (AKA patricia trie or radix trie)

                             Each internal node has at least 2 children.
  suffix trie is a trie which stores all suffix of a given string.
  suffix tree is a suffix radix tree.
                   that enables linear time construction and fast
                   algorithms of other problems on a string.
   1 pronounced as in word retrieval by its inventor, /tri:/ “tree”, but
pronounced /traI/ “try” by other authors                           .   .    .        .      .       .

      jc-yang (Osaka-U)                                SuffixTree                November 12, 2011       4 / 25
Trie & Radix Tree

.
Trie of “A to tea ted ten i
in inn”
.
                                 .
                                 Radix tree example
                                 .




                                 .

.
                                          .   .   .        .      .       .

    jc-yang (Osaka-U)         SuffixTree               November 12, 2011       5 / 25
Suffix Trie & Suffix Tree of “banana”
                   ^




     b        a          n       $
                                                                                0:6


 a       n         $         a                                  banana$ a             na         $


                                                      1:0             8:6                 6:6            10:6
 n       a                   n       $

                                                                na     $                   na$       $
 a       n         $         a
                                                      4:6             9:6                 3:2            7:6

 n       a                   $
                                                      na$   $


 a       $
                                                2:1         5:6



 $


                                                                  .         .         .          .         .    .

     jc-yang (Osaka-U)                   SuffixTree                                         November 12, 2011        6 / 25
Suffix Tree of “mississippi”
                                                    mississippi$

                                                            0:11


                            (1,2)                     (0,...) (8,9)            (11,...)      (2,3)
                              i                     mississi... p                $...          s


                  12:8                        1:0          15:10                  18:11              4:4


            (8,...) (2,5)          (11,...)                   (10,...)         (9,...)
            ppi$... ssi              $...                       i$...          pi$...

                                                                                             (3,5)           (4,5)
  13:8           6:8                 17:11                 16:10                  14:8
                                                                                               si              i

                (8,...)       (5,...)
                ppi$...     ssippi$...


          7:8                2:1                                         8:8                                 10:8


                                                               (8,...) (5,...)                           (8,...)   (5,...)
                                                               ppi$... ssippi$...                        ppi$... ssippi$...


                                                     9:8                 3:2              11:8                  5:4



                                                                                                     .           .            .        .      .       .

     jc-yang (Osaka-U)                                             SuffixTree                                                      November 12, 2011       7 / 25
Part Outline


.
1   What is Suffix Tree

.
2   History & Naïve Algorithm

.
3   Optimization of Naïve Algorithm

.
4   Examples & Analysis




                                            .   .   .        .      .       .

      jc-yang (Osaka-U)         SuffixTree               November 12, 2011       8 / 25
History of Suffix Tree Algorithms
.
                                                           M. Farach. Optimal suffix tree
    First linear algorithm was introduced by Weiner            construction with large
                                                               alphabets. In focs, page
    1973 as position tree. Awarded by Donald Knuth             137. Published by the IEEE
                                                               Computer Society, 1997.
    as “Algorithm of the year 1973”.                       E. M. McCreight. A
                                                               space-economical suffix
    Greatly simplified by McCreight 1976.                       tree construction algorithm.
                                                               J. ACM, 23:262–272, April
Above two algorithms are processing string backward.           1976. ISSN 0004-5411.
                                                               doi: http://doi.acm.org/10.
                                                               1145/321941.321946. URL
    First online construction by Ukkonen 1995, which           http:
                                                               //doi.acm.org/10.
    is easier to understand.                                   1145/321941.321946.
                                                           E. Ukkonen. On-line
                                                               construction of suffix trees.
Above algorithms assume size of alphabet as fixed               Algorithmica, 14:249–260,
                                                               1995. ISSN 0178-4617.
constant .                                                     URL
                                                               http://dx.doi.org/10.
    Limitation was break by Farach 1997, optimal for           1007/BF01206331.
                                                               10.1007/BF01206331.
    all alphabets.                                         P. Weiner. Linear pattern
                                                               matching algorithms. In
                                                               Switching and Automata
Further study are continued to scale to scenarios when         Theory, 1973. SWAT ’08.
                                                               IEEE Conference Record of
the whole suffix tree or even input string cannot fit into       14th Annual Symposium on,
                                                               pages 1 –11, oct. 1973. doi:
memory.                                           .   .    .
                                                               10.1109/SWAT.1973.13.
                                                                    .        .        .

     jc-yang (Osaka-U)             SuffixTree                   November 12, 2011          9 / 25
Backward Construction of Suffix Tree
   1                       2                        3                                4


  banana$            banana$ anana$      banana$ anana$       nana$        banana$ ana nana$



                                                                                    na$ $




                 7                              6                               5


         banana$ a na          $      banana$       a    na           banana$ ana                na



              na $         na$ $         na $           na$ $             na$ $                na$ $


        na$ $                         na$ $


                                                                      .     .            .         .       .     .

       jc-yang (Osaka-U)                        SuffixTree                                    November 12, 2011   10 / 25
Construct Suffix Tree by Sorting Suffix
Suffix :                  Sorted suffix :          Tree of sorted suffix :
mississippi              i                       | -i - >| -
ississippi               ippi                    |       | - ppi
ssissippi                issippi                 |       | - ssi - >| - ppi
sissippi                 ississippi              |                   | - ssippi
issippi                  mississippi             | - mississippi
ssippi                   pi                      | -p - >| - i
sippi                    ppi                     |       | - pi
ippi                     sippi                   | -s - >| -i - - >| - ppi
ppi                      sissippi                        |         | - ssippi
pi                       ssippi                          | - si - >| - ppi
i                        ssissippi                                 | - ssippi
Time complexity will be O(N 2 log N) .
Space complexity will be O(N 2 ) .
                                                        .   .    .         .       .     .

     jc-yang (Osaka-U)               SuffixTree                       November 12, 2011   11 / 25
Naïve Algorithm

S UFFIX T REE(string)
1 for i ← 1 to length(string)
2       do U PDATE(treei )    £ Phrase i

U PDATE(treei )
1 for j ← 1 to i
2       do node ← treei .F IND(suffix[j to i − 1])
3          E XTEND(node, string[i])       £ Extension j

Time complexity will be O(N 3 ) .
Space complexity will be O(N 2 ) .
The challenge is to make sure treei is updated to treei+1 efficiently.
                                                .   .    .         .       .     .

     jc-yang (Osaka-U)           SuffixTree                   November 12, 2011   12 / 25
Suffix Extend Cases of Naïve Algorithm
.
     Case I If path Sj...i ends at leaf, append a char Si+1 to end of edge into leaf.
         Sj...i = . . . na                                       Sj...i+1 = . . . nan

                    na                                                  nan

    Case II If path Sj...i ends in the middle of an edge , and next char Si+1 is not equal to
              the next char in the edge, split that edge, create a internal node, add a new
              edge to a new leaf.
         Sj...i = . . . na                                        Sj...i+1 = . . . nay
                                                                                         n
                    nan                                                 na
                                                                                         y
    Case III If path Sj...i ends in the middle of an edge , and next char Si+1 is equal to the
               next char in the edge, do nothing, extenstion has done.
          Sj...i = . . . na                                      Sj...i+1 = . . . nan
                    nan                                                 nan

                                                              .     .     .         .        .    .

       jc-yang (Osaka-U)                   SuffixTree                          November 12, 2011   13 / 25
Naïve Online Construction of Suffix Tree
   1                   2            3                                  4


       b           ba      a     ban an n                       bana ana na



               7                     6                             5


       banana$ a na        $   banana anana     nana           banan anan nan



           na $        na$ $



       na$ $


                                                       .   .       .         .       .     .

   jc-yang (Osaka-U)                SuffixTree                          November 12, 2011   14 / 25
Properties of Suffix Tree
.
    . Each update will add exactly 1
    1

      leaf node .
                                                                                          0:6
              nr_leaf = N
    . Suffix tree is full tree.
    2
                                                                           banana$ a              na          $
              Each internal node has at least 2
              children.                                      1:0                    8:6             6:6               10:6
              nr_internal < N
              nr_node < 2N                                                 na        $                  na$       $
    . Worst case Fabonacci word
    3
                                                             4:6                    9:6             3:2               7:6
              abaababaabaab
    . Suffix is either explicit or implicit.
    4
                                                             na$       $

              Explicit when it ends at a node.         2:1             5:6
              Implicit when it ends in the
              middle of an edge.

                                                                   .            .         .         .         .        .

        jc-yang (Osaka-U)                  SuffixTree                                          November 12, 2011         15 / 25
Part Outline


.
1   What is Suffix Tree

.
2   History & Naïve Algorithm

.
3   Optimization of Naïve Algorithm

.
4   Examples & Analysis




                                            .   .   .         .       .     .

      jc-yang (Osaka-U)         SuffixTree               November 12, 2011   16 / 25
Optimization of Naïve Algorithm
.
     . Substrings can be represented as (start, end) pair of their index in
     1

       orignal string.
                    Reduce space complexity to O(N) if size of alphabet is fixed constant.
     . Once a leaf, Always a leaf
     2

                    Represent edge that links to a leaf as (start, · · · ).
                    Extend leaf nodes for free. We do not need Extend Case I.
                               0:6
                                                                                                                   0:6


                    banana$ a        na         $                                                    (1,2)        (0,...)  (6,...)           (2,4)
                                                                                                       a        banana$... $...               na

          1:0            8:6          6:6               10:6                      8:6                 1:0                   10:6                    6:6


                                                                                 (6,...)   (2,4)                                                   (6,...)   (4,...)
                    na    $               na$       $                             $...      na                                                      $...     na$...


          4:6            9:6          3:2               7:6                9:6                 4:6                                     7:6                    3:2


                                                                                           (6,...)    (4,...)
          na$   $                                                                           $...      na$...


                                                                                   5:6                  2:1
    2:1         5:6                                                                        .            .            .             .           .              .

           jc-yang (Osaka-U)                                   SuffixTree                                                 November 12, 2011                    17 / 25
Active Point

                                     During a phrase, if we meet Extend Case III, that is if we found
                                     S[i + 1] already exists in suffix[j . . . i] then S[i + 1] will exists in
                                     ∀suffix[k . . . i], k ∈ j . . . i.
                                     Thus Case III is a sign that means update of this phrase is finished.
                                     During phrase i if we stopped at suffix[k . . . i] by Case III, then in
                                     next phrase we can start from suffix[k . . . i + 1] because all suffix
                                     start with 1 . . . k − 1 will end at Case I.
                                     We called this point(current internal node, current position k in
                                     string) as Active Point.
a                    ab                       aba                      abaa                   abaab                    abaaba                                  abaabab                                              abaababa                                              abaababaa                                           abaababaab                                          abaababaaba                                                         abaababaabaa                                                                       abaababaaba ab


0:0               0:1                       0:2                      0:3                      0:4                      0:5                                        0:6                                                   0:7                                                 0:8                                                0:9                                                 0:10                                                                             0:11                                                                          0:12


 (0,...)         (0,...)   (1,...)         (0,...)   (1,...)         (0,1)    (1,...)     (0,1)        (1,...)     (0,1)         (1,...)                       (0,1)      (1,3)                                     (0,1)      (1,3)                                       (0,1)    (1,3)                                     (0,1)     (1,3)                                     (0,1)     (1,3)                                                                   (0,1)       (1,3)                                                            (0,1)         (1,3)
  a...            ab...     b...           aba...     ba...            a      baa...        a         baab...        a          baaba...                         a         ba                                         a         ba                                           a       ba                                         a        ba                                         a        ba                                                                       a          ba                                                                a            ba


1:0        1:0              2:1      1:0              2:1      3:3              2:1     3:3             2:1      3:3             2:1                    3:3               7:6                                 3:3              7:6                                  3:3             7:6                                 3:3             7:6                                 3:3             7:6                                                              3:3                  7:6                                                   3:3                          7:6


                                                                (3,...)       (1,...)    (3,...)       (1,...)    (3,...)     (1,...)             (3,...) (1,3)              (3,...)   (6,...)          (3,...) (1,3)             (3,...)   (6,...)           (3,...) (1,3)             (3,...)   (6,...)       (3,...)  (1,3)            (3,...)     (6,...)       (3,...)  (1,3)             (3,...)     (6,...)                                    (3,6) (1,3)                  (3,6)       (6,...)                             (3,6) (1,3)                        (3,6)         (6,...)
                                                                 a...         baa...      ab...       baab...     aba...     baaba...            abab... ba                 abab...     b...           ababa... ba               ababa...    ba...          ababaa... ba              ababaa...   baa...      ababaab... ba             ababaab...   baab...     ababaaba... ba             ababaaba...   baaba...                                     aba   ba                     aba       baabaa...                             aba ba                             aba        baabaab...


                                                               4:3              1:0     4:3             1:0      4:3             1:0       4:3          5:6               2:1          8:6       4:3          5:6              2:1          8:6       4:3           5:6              2:1          8:6       4:3          5:6                2:1       8:6       4:3         5:6                 2:1       8:6                        13:11                   5:6                11:11           8:6                   13:11             5:6                         11:11           8:6


                                                                                                                                                     (3,...)    (6,...)                                    (3,...) (6,...)                                       (3,...) (6,...)                                     (3,...)   (6,...)                                    (3,...)   (6,...)                                      (11,...) (6,...)         (3,6)      (6,...)       (11,...)     (6,...)            (11,...) (6,...)  (3,6)               (6,...)       (11,...)      (6,...)
                                                                                                                                                    abab...      b...                                     ababa... ba...                                       ababaa... baa...                                    ababaab... baab...                                  ababaaba... baaba...                                        a... baabaa...          aba      baabaa...        a...      baabaa...            ab... baabaab... aba               baabaab...       ab...      baabaab...


                                                                                                                                                 1:0              6:6                                  1:0              6:6                                   1:0             6:6                                 1:0                 6:6                             1:0                 6:6                            14:11        4:3            9:11            6:6           12:11           2:1     14:11         4:3           9:11              6:6           12:11           2:1


                                                                                                                                                                                                                                                                                                                                                                                                                                                (11,...)      (6,...)                                                              (11,...)     (6,...)
                                                                                                                                                                                                                                                                                                                                                                                                                                                  a...       baabaa...                                                              ab...     baabaab...


                                                                                                                                                                                                                                                                                                                                                                                                                                             10:11            1:0                                                              10:11             1:0




                                                                                                                                                                                                                                                                                                                                                                            .                                        .                                .                                        .                               .                                            .

                                     jc-yang (Osaka-U)                                                                                                                                                                                                       SuffixTree                                                                                                                                                                                      November 12, 2011                                                                                                 18 / 25
Ukk’s Update using Active Point

U PDATE(treei )
 1 current_suffix ← active_point
 2 next_char ← string[i]
 3 while True
 4      do if there exists edge start with next_char
 5              then break       £ Case III
 6              else
 7                   split current edge if implicit
 8                   create new leaf with new edge labelled next_char
 9          if current_suffix is empty
10              then break
11              else current_suffix ← next shorter suffix
12 active_point ← current_suffix
                                              .   .   .         .       .     .

     jc-yang (Osaka-U)          SuffixTree                 November 12, 2011   19 / 25
Suffix Link to find next shorter suffix
            Suffix link
                             Internal node of suffix X α has a link to node α .
                             If α is empty, suffix link points to root.
            How to create suffix link
                             Link together every internal nodes that are created by splitting in
                             same phrase.
                            banana$                                                     banana$

                             0:6                                                        0:6


            (1,2) (0,...)  (6,...)               (2,4)                  (0,...)  (1,2)             (6,...)     (2,4)
              a banana$... $...                   na                  banana$...   a                $...        na


 8:6                 1:0               10:6         6:6         1:0              8:6               10:6            6:6


  (6,...)         (2,4)                       (6,...) (4,...)            (6,...) (2,4)                       (6,...) (4,...)
   $...            na                          $... na$...                $... na                             $... na$...


 9:6                 4:6               7:6          3:2         9:6              4:6               7:6             3:2


                  (6,...)    (4,...)                                          (6,...)    (4,...)
                   $...      na$...                                            $...      na$...


            5:6               2:1                                       5:6               2:1
                                                                                                                               .   .   .         .       .     .

             jc-yang (Osaka-U)                                                                SuffixTree                                    November 12, 2011   20 / 25
Fast jump using Suffix Link


   Assume we are at Suffix X αβ , whose parent internal node
   represent X α .
      1. Go back to parent internal node,
      2. Jumping follow the node’s suffix link to the node represent Suffix
         α
      3. Go down to Suffix αβ .

   Even jump down in step 3 because we already know length of β .
   (Skip/Count trick)
   Combining all these tricks we can Extend a character in O(1)
   time.


                                                .    .   .         .       .     .

   jc-yang (Osaka-U)             SuffixTree                   November 12, 2011   21 / 25
Part Outline


.
1   What is Suffix Tree

.
2   History & Naïve Algorithm

.
3   Optimization of Naïve Algorithm

.
4   Examples & Analysis




                                            .   .   .         .       .     .

      jc-yang (Osaka-U)         SuffixTree               November 12, 2011   22 / 25
Experiment – mississippi
.

                        m

                        0:0


                         (0,...)
                          m...


                        1:0




                                               .   .   .         .       .     .

    jc-yang (Osaka-U)              SuffixTree               November 12, 2011   23 / 25
Experiment – mississippi
.
                             mi

                         0:1


                        (1,...)   (0,...)
                          i...     mi...


               2:1                    1:0




                                               .   .   .         .       .     .

    jc-yang (Osaka-U)              SuffixTree               November 12, 2011   23 / 25
Experiment – mississippi
.
                             mis

                              0:2


                        (1,...) (2,...)        (0,...)
                         is... s...            mis...


        2:1                   3:2                     1:0




                                                            .   .   .         .       .     .

    jc-yang (Osaka-U)                     SuffixTree                     November 12, 2011   23 / 25
Experiment – mississippi
.
                            miss

                              0:3


                        (1,...) (2,...)        (0,...)
                        iss... ss...           miss...


        2:1                   3:2                     1:0




                                                            .   .   .         .       .     .

    jc-yang (Osaka-U)                     SuffixTree                     November 12, 2011   23 / 25
Experiment – mississippi
.
                             miss i

                              0:4


                        (1,...) (0,...)        (2,3)
                        issi... missi...         s


         2:1                  1:0                      4:4


                                               (4,...) (3,...)
                                                 i...   si...


                             5:4                       3:2




                                                                 .   .   .         .       .     .

    jc-yang (Osaka-U)                      SuffixTree                         November 12, 2011   23 / 25
Experiment – mississippi
.
                              miss is

                              0:5


                         (1,...) (0,...)          (2,3)
                        issis... missis...          s


         2:1                   1:0                        4:4


                                                 (4,...) (3,...)
                                                  is... sis...


                              5:4                         3:2




                                                                   .   .   .         .       .     .

    jc-yang (Osaka-U)                        SuffixTree                         November 12, 2011   23 / 25
Experiment – mississippi
.
                               miss iss

                             0:6


                         (1,...)   (0,...)          (2,3)
                        ississ... mississ...          s


         2:1                       1:0                      4:4


                                                  (4,...) (3,...)
                                                  iss... siss...


                                5:4                        3:2




                                                                    .   .   .         .       .     .

    jc-yang (Osaka-U)                          SuffixTree                        November 12, 2011   23 / 25
Experiment – mississippi
.
                                mississi

                              0:7


                          (1,...)    (0,...)           (2,3)
                        ississi... mississi...           s


         2:1                        1:0                       4:4


                                                    (4,...) (3,...)
                                                    issi... sissi...


                                 5:4                         3:2




                                                                       .   .   .         .       .     .

    jc-yang (Osaka-U)                            SuffixTree                         November 12, 2011   23 / 25
Experiment – mississippi
.
                                               mississip

                                       0:8


                        (8,...) (1,2)                   (0,...)             (2,3)
                         p...     i                   mississi...             s


    14:8                     12:8                           1:0                      4:4


                   (8,...)     (2,5)
                    p...        ssi

                                                                            (3,5)             (4,5)
    13:8                     6:8
                                                                              si                i

                   (8,...)     (5,...)
                    p...       ssip...


     7:8                     2:1                            8:8                               10:8


                                          (8,...)     (5,...)                       (8,...)      (5,...)
                                           p...       ssip...                        p...        ssip...


           9:8                          3:2                         11:8                       5:4




                                                                                                           .   .   .         .       .     .

    jc-yang (Osaka-U)                                                  SuffixTree                                       November 12, 2011   23 / 25
Experiment – mississippi
.
                                              mississip p

                                       0:9


                        (8,...) (1,2)                   (0,...)             (2,3)
                        pp...     i                   mississi...             s


    14:8                     12:8                           1:0                      4:4


                   (8,...)     (2,5)
                   pp...        ssi

                                                                            (3,5)             (4,5)
    13:8                     6:8
                                                                              si                i

                   (8,...)      (5,...)
                   pp...       ssipp...


     7:8                     2:1                            8:8                               10:8


                                          (8,...)      (5,...)                      (8,...)       (5,...)
                                          pp...       ssipp...                      pp...        ssipp...


           9:8                          3:2                         11:8                       5:4




                                                                                                            .   .   .         .       .     .

    jc-yang (Osaka-U)                                                  SuffixTree                                        November 12, 2011   23 / 25
Experiment – mississippi
.
                                                       mississippi

                                                        0:10


                                              (1,2)       (0,...)         (8,9)             (2,3)
                                                i       mississi...         p                 s


                             12:8              1:0                       15:10                    4:4


               (2,5)                (8,...)                           (9,...) (10,...)
                ssi                 ppi...                             pi...    i...

                                                                                         (3,5)          (4,5)
    6:8                            13:8               14:8                   16:10
                                                                                           si             i

     (8,...)            (5,...)
     ppi...            ssippi...


    7:8                     2:1                                        8:8                                 10:8


                                                                (8,...) (5,...)                         (8,...)    (5,...)
                                                                ppi... ssippi...                        ppi...    ssippi...


                                              9:8                      3:2                 11:8                   5:4




                                                                                                                        .     .   .         .       .     .

      jc-yang (Osaka-U)                                                              SuffixTree                                        November 12, 2011   23 / 25
Experiment – mississippi
.
                                                        mississippi$

                                                                0:11


                                (1,2)                     (0,...) (8,9)            (11,...)      (2,3)
                                  i                     mississi... p                $...          s


                      12:8                        1:0          15:10                  18:11              4:4


                (8,...) (2,5)          (11,...)                   (10,...)         (9,...)
                ppi$... ssi              $...                       i$...          pi$...

                                                                                                 (3,5)         (4,5)
    13:8             6:8                 17:11                 16:10                  14:8
                                                                                                   si            i

                    (8,...)       (5,...)
                    ppi$...     ssippi$...


              7:8                2:1                                         8:8                               10:8


                                                                   (8,...) (5,...)                       (8,...)   (5,...)
                                                                   ppi$... ssippi$...                    ppi$... ssippi$...


                                                         9:8                 3:2              11:8                5:4




                                                                                                                        .     .   .         .       .     .

           jc-yang (Osaka-U)                                                         SuffixTree                                        November 12, 2011   23 / 25
Time Complexity Analysis
$
i
p
p
i
s
s
i
s
s
i
m
 .   m        i          s   s   i   s     s         i   p       p       i          $
Time complexity is 2N = O(N) .
                                                             .       .       .          .      .     .

     jc-yang (Osaka-U)                   SuffixTree                               November 12, 2011   24 / 25
Experiment – English text
.

                           16
                                                                "data" using 1:3
                           14
Construction Time (sec.)




                           12

                           10

                           8

                           6

                           4

                           2

                           0
                                0      20000 40000   60000 80000 100000 120000 140000
                                                     File size (byte)
                                                                             .     .   .         .       .     .

                           jc-yang (Osaka-U)               SuffixTree                       November 12, 2011   25 / 25

Weitere ähnliche Inhalte

Mehr von Jiachen Yang

データモデルの更新を効率よく検証するの並列可能性
データモデルの更新を効率よく検証するの並列可能性データモデルの更新を効率よく検証するの並列可能性
データモデルの更新を効率よく検証するの並列可能性Jiachen Yang
 
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...Jiachen Yang
 
チェックリストと分割に基づく 網羅と使用テスト
チェックリストと分割に基づく  網羅と使用テストチェックリストと分割に基づく  網羅と使用テスト
チェックリストと分割に基づく 網羅と使用テストJiachen Yang
 
Active Refinement of Clone Anomaly Reports
Active Refinement of Clone Anomaly ReportsActive Refinement of Clone Anomaly Reports
Active Refinement of Clone Anomaly ReportsJiachen Yang
 
Inference and Checking of Object Ownership
Inference  and  Checking  of  Object OwnershipInference  and  Checking  of  Object Ownership
Inference and Checking of Object OwnershipJiachen Yang
 
基于OpenNEbula的虚拟化服务器集群中节能 的研究
基于OpenNEbula的虚拟化服务器集群中节能 的研究基于OpenNEbula的虚拟化服务器集群中节能 的研究
基于OpenNEbula的虚拟化服务器集群中节能 的研究Jiachen Yang
 
Output fica.beamer.43
Output fica.beamer.43Output fica.beamer.43
Output fica.beamer.43Jiachen Yang
 

Mehr von Jiachen Yang (8)

データモデルの更新を効率よく検証するの並列可能性
データモデルの更新を効率よく検証するの並列可能性データモデルの更新を効率よく検証するの並列可能性
データモデルの更新を効率よく検証するの並列可能性
 
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...
Slides for Semantic Versioning versus Breaking Changes: A Study of the Maven ...
 
チェックリストと分割に基づく 網羅と使用テスト
チェックリストと分割に基づく  網羅と使用テストチェックリストと分割に基づく  網羅と使用テスト
チェックリストと分割に基づく 網羅と使用テスト
 
Active Refinement of Clone Anomaly Reports
Active Refinement of Clone Anomaly ReportsActive Refinement of Clone Anomaly Reports
Active Refinement of Clone Anomaly Reports
 
Inference and Checking of Object Ownership
Inference  and  Checking  of  Object OwnershipInference  and  Checking  of  Object Ownership
Inference and Checking of Object Ownership
 
基于OpenNEbula的虚拟化服务器集群中节能 的研究
基于OpenNEbula的虚拟化服务器集群中节能 的研究基于OpenNEbula的虚拟化服务器集群中节能 的研究
基于OpenNEbula的虚拟化服务器集群中节能 的研究
 
Output fica.beamer.43
Output fica.beamer.43Output fica.beamer.43
Output fica.beamer.43
 
Cloud sim report
Cloud sim reportCloud sim report
Cloud sim report
 

Kürzlich hochgeladen

PORTAFOLIO 2024_ ANASTASIYA KUDINOVA
PORTAFOLIO   2024_  ANASTASIYA  KUDINOVAPORTAFOLIO   2024_  ANASTASIYA  KUDINOVA
PORTAFOLIO 2024_ ANASTASIYA KUDINOVAAnastasiya Kudinova
 
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书zdzoqco
 
shot list for my tv series two steps back
shot list for my tv series two steps backshot list for my tv series two steps back
shot list for my tv series two steps back17lcow074
 
cda.pptx critical discourse analysis ppt
cda.pptx critical discourse analysis pptcda.pptx critical discourse analysis ppt
cda.pptx critical discourse analysis pptMaryamAfzal41
 
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一F La
 
Business research proposal mcdo.pptxBusiness research proposal mcdo.pptxBusin...
Business research proposal mcdo.pptxBusiness research proposal mcdo.pptxBusin...Business research proposal mcdo.pptxBusiness research proposal mcdo.pptxBusin...
Business research proposal mcdo.pptxBusiness research proposal mcdo.pptxBusin...mrchrns005
 
西北大学毕业证学位证成绩单-怎么样办伪造
西北大学毕业证学位证成绩单-怎么样办伪造西北大学毕业证学位证成绩单-怎么样办伪造
西北大学毕业证学位证成绩单-怎么样办伪造kbdhl05e
 
(办理学位证)埃迪斯科文大学毕业证成绩单原版一比一
(办理学位证)埃迪斯科文大学毕业证成绩单原版一比一(办理学位证)埃迪斯科文大学毕业证成绩单原版一比一
(办理学位证)埃迪斯科文大学毕业证成绩单原版一比一Fi sss
 
Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025Rndexperts
 
办理学位证(NTU证书)新加坡南洋理工大学毕业证成绩单原版一比一
办理学位证(NTU证书)新加坡南洋理工大学毕业证成绩单原版一比一办理学位证(NTU证书)新加坡南洋理工大学毕业证成绩单原版一比一
办理学位证(NTU证书)新加坡南洋理工大学毕业证成绩单原版一比一A SSS
 
原版1:1定制堪培拉大学毕业证(UC毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制堪培拉大学毕业证(UC毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制堪培拉大学毕业证(UC毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制堪培拉大学毕业证(UC毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
3D Printing And Designing Final Report.pdf
3D Printing And Designing Final Report.pdf3D Printing And Designing Final Report.pdf
3D Printing And Designing Final Report.pdfSwaraliBorhade
 
306MTAMount UCLA University Bachelor's Diploma in Social Media
306MTAMount UCLA University Bachelor's Diploma in Social Media306MTAMount UCLA University Bachelor's Diploma in Social Media
306MTAMount UCLA University Bachelor's Diploma in Social MediaD SSS
 
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改yuu sss
 
'CASE STUDY OF INDIRA PARYAVARAN BHAVAN DELHI ,
'CASE STUDY OF INDIRA PARYAVARAN BHAVAN DELHI ,'CASE STUDY OF INDIRA PARYAVARAN BHAVAN DELHI ,
'CASE STUDY OF INDIRA PARYAVARAN BHAVAN DELHI ,Aginakm1
 
Design Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William VickeryDesign Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William VickeryWilliamVickery6
 
DAKSHIN BIHAR GRAMIN BANK: REDEFINING THE DIGITAL BANKING EXPERIENCE WITH A U...
DAKSHIN BIHAR GRAMIN BANK: REDEFINING THE DIGITAL BANKING EXPERIENCE WITH A U...DAKSHIN BIHAR GRAMIN BANK: REDEFINING THE DIGITAL BANKING EXPERIENCE WITH A U...
DAKSHIN BIHAR GRAMIN BANK: REDEFINING THE DIGITAL BANKING EXPERIENCE WITH A U...Rishabh Aryan
 
Design principles on typography in design
Design principles on typography in designDesign principles on typography in design
Design principles on typography in designnooreen17
 
Architecture case study India Habitat Centre, Delhi.pdf
Architecture case study India Habitat Centre, Delhi.pdfArchitecture case study India Habitat Centre, Delhi.pdf
Architecture case study India Habitat Centre, Delhi.pdfSumit Lathwal
 

Kürzlich hochgeladen (20)

PORTAFOLIO 2024_ ANASTASIYA KUDINOVA
PORTAFOLIO   2024_  ANASTASIYA  KUDINOVAPORTAFOLIO   2024_  ANASTASIYA  KUDINOVA
PORTAFOLIO 2024_ ANASTASIYA KUDINOVA
 
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书
 
shot list for my tv series two steps back
shot list for my tv series two steps backshot list for my tv series two steps back
shot list for my tv series two steps back
 
cda.pptx critical discourse analysis ppt
cda.pptx critical discourse analysis pptcda.pptx critical discourse analysis ppt
cda.pptx critical discourse analysis ppt
 
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一
 
Business research proposal mcdo.pptxBusiness research proposal mcdo.pptxBusin...
Business research proposal mcdo.pptxBusiness research proposal mcdo.pptxBusin...Business research proposal mcdo.pptxBusiness research proposal mcdo.pptxBusin...
Business research proposal mcdo.pptxBusiness research proposal mcdo.pptxBusin...
 
西北大学毕业证学位证成绩单-怎么样办伪造
西北大学毕业证学位证成绩单-怎么样办伪造西北大学毕业证学位证成绩单-怎么样办伪造
西北大学毕业证学位证成绩单-怎么样办伪造
 
(办理学位证)埃迪斯科文大学毕业证成绩单原版一比一
(办理学位证)埃迪斯科文大学毕业证成绩单原版一比一(办理学位证)埃迪斯科文大学毕业证成绩单原版一比一
(办理学位证)埃迪斯科文大学毕业证成绩单原版一比一
 
Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025
 
办理学位证(NTU证书)新加坡南洋理工大学毕业证成绩单原版一比一
办理学位证(NTU证书)新加坡南洋理工大学毕业证成绩单原版一比一办理学位证(NTU证书)新加坡南洋理工大学毕业证成绩单原版一比一
办理学位证(NTU证书)新加坡南洋理工大学毕业证成绩单原版一比一
 
原版1:1定制堪培拉大学毕业证(UC毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制堪培拉大学毕业证(UC毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制堪培拉大学毕业证(UC毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制堪培拉大学毕业证(UC毕业证)#文凭成绩单#真实留信学历认证永久存档
 
3D Printing And Designing Final Report.pdf
3D Printing And Designing Final Report.pdf3D Printing And Designing Final Report.pdf
3D Printing And Designing Final Report.pdf
 
306MTAMount UCLA University Bachelor's Diploma in Social Media
306MTAMount UCLA University Bachelor's Diploma in Social Media306MTAMount UCLA University Bachelor's Diploma in Social Media
306MTAMount UCLA University Bachelor's Diploma in Social Media
 
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改
 
'CASE STUDY OF INDIRA PARYAVARAN BHAVAN DELHI ,
'CASE STUDY OF INDIRA PARYAVARAN BHAVAN DELHI ,'CASE STUDY OF INDIRA PARYAVARAN BHAVAN DELHI ,
'CASE STUDY OF INDIRA PARYAVARAN BHAVAN DELHI ,
 
Design Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William VickeryDesign Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William Vickery
 
DAKSHIN BIHAR GRAMIN BANK: REDEFINING THE DIGITAL BANKING EXPERIENCE WITH A U...
DAKSHIN BIHAR GRAMIN BANK: REDEFINING THE DIGITAL BANKING EXPERIENCE WITH A U...DAKSHIN BIHAR GRAMIN BANK: REDEFINING THE DIGITAL BANKING EXPERIENCE WITH A U...
DAKSHIN BIHAR GRAMIN BANK: REDEFINING THE DIGITAL BANKING EXPERIENCE WITH A U...
 
Design principles on typography in design
Design principles on typography in designDesign principles on typography in design
Design principles on typography in design
 
Architecture case study India Habitat Centre, Delhi.pdf
Architecture case study India Habitat Centre, Delhi.pdfArchitecture case study India Habitat Centre, Delhi.pdf
Architecture case study India Habitat Centre, Delhi.pdf
 
Call Girls in Pratap Nagar, 9953056974 Escort Service
Call Girls in Pratap Nagar,  9953056974 Escort ServiceCall Girls in Pratap Nagar,  9953056974 Escort Service
Call Girls in Pratap Nagar, 9953056974 Escort Service
 

Ukk's Algorithm of Suffix Tree

  • 1. . Algorithm of Suffix Tree by farseerfc@gmail.com . Jiachen Yang1 1 Research Student in Osaka University November 12, 2011 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 1 / 25
  • 2. Part Outline . 1 What is Suffix Tree . 2 History & Naïve Algorithm . 3 Optimization of Naïve Algorithm . 4 Examples & Analysis . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 2 / 25
  • 3. What can we do with suffix tree? . Find properties of the strings Linear algorithms for exact string . Find the longest common substrings of the 1 matching string Si and Sj in θ (ni + nj ) time . like KMP . Find all maximal pairs, maximal repeats or 2 Search for strings supermaximal repeats in θ (n + z) time, if there . Check if a string P of length m is 1 are z such repeats . . Find the Lempel-Ziv decomposition in θ (n) 3 a substring in O(m) time. . Find all z occurrences of the 2 time . . Find the longest repeated substrings in θ (n) patterns P1 , · · · , Pq of total 4 length m as substrings in time. . Find the most frequently occurring substrings 5 O(m + z) time. . Search for a regular expression P 3 of a minimum length in θ (n) time. . Find the shortest strings from ∑ that do not 6 in time expected sublinear in n . . Find for each suffix of a pattern 4 occur in D, in O(n + z) time, if there are z such P, the length of the longest strings. . Find the shortest substrings occurring only 7 match between a prefix of P[i . . . m] and a substring in D in once in θ (n) time. . Find, for each i, the shortest substrings of Si θ (m) time . 8 . ... 5 not occurring elsewhere in D in θ (n) time. . ... 9 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 3 / 25
  • 4. Trie, Radix Tree, Suffix Trie & Suffix Tree trie1 is a dictionary tree. (AKA prefix tree) stores a set of words. each node represents a character except that root is empty string. words with common prefix share same parent nodes. minimal deterministic finite automaton that accepts all words. radix tree is a trie with compressed chain of nodes. (AKA patricia trie or radix trie) Each internal node has at least 2 children. suffix trie is a trie which stores all suffix of a given string. suffix tree is a suffix radix tree. that enables linear time construction and fast algorithms of other problems on a string. 1 pronounced as in word retrieval by its inventor, /tri:/ “tree”, but pronounced /traI/ “try” by other authors . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 4 / 25
  • 5. Trie & Radix Tree . Trie of “A to tea ted ten i in inn” . . Radix tree example . . . . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 5 / 25
  • 6. Suffix Trie & Suffix Tree of “banana” ^ b a n $ 0:6 a n $ a banana$ a na $ 1:0 8:6 6:6 10:6 n a n $ na $ na$ $ a n $ a 4:6 9:6 3:2 7:6 n a $ na$ $ a $ 2:1 5:6 $ . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 6 / 25
  • 7. Suffix Tree of “mississippi” mississippi$ 0:11 (1,2) (0,...) (8,9) (11,...) (2,3) i mississi... p $... s 12:8 1:0 15:10 18:11 4:4 (8,...) (2,5) (11,...) (10,...) (9,...) ppi$... ssi $... i$... pi$... (3,5) (4,5) 13:8 6:8 17:11 16:10 14:8 si i (8,...) (5,...) ppi$... ssippi$... 7:8 2:1 8:8 10:8 (8,...) (5,...) (8,...) (5,...) ppi$... ssippi$... ppi$... ssippi$... 9:8 3:2 11:8 5:4 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 7 / 25
  • 8. Part Outline . 1 What is Suffix Tree . 2 History & Naïve Algorithm . 3 Optimization of Naïve Algorithm . 4 Examples & Analysis . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 8 / 25
  • 9. History of Suffix Tree Algorithms . M. Farach. Optimal suffix tree First linear algorithm was introduced by Weiner construction with large alphabets. In focs, page 1973 as position tree. Awarded by Donald Knuth 137. Published by the IEEE Computer Society, 1997. as “Algorithm of the year 1973”. E. M. McCreight. A space-economical suffix Greatly simplified by McCreight 1976. tree construction algorithm. J. ACM, 23:262–272, April Above two algorithms are processing string backward. 1976. ISSN 0004-5411. doi: http://doi.acm.org/10. 1145/321941.321946. URL First online construction by Ukkonen 1995, which http: //doi.acm.org/10. is easier to understand. 1145/321941.321946. E. Ukkonen. On-line construction of suffix trees. Above algorithms assume size of alphabet as fixed Algorithmica, 14:249–260, 1995. ISSN 0178-4617. constant . URL http://dx.doi.org/10. Limitation was break by Farach 1997, optimal for 1007/BF01206331. 10.1007/BF01206331. all alphabets. P. Weiner. Linear pattern matching algorithms. In Switching and Automata Further study are continued to scale to scenarios when Theory, 1973. SWAT ’08. IEEE Conference Record of the whole suffix tree or even input string cannot fit into 14th Annual Symposium on, pages 1 –11, oct. 1973. doi: memory. . . . 10.1109/SWAT.1973.13. . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 9 / 25
  • 10. Backward Construction of Suffix Tree 1 2 3 4 banana$ banana$ anana$ banana$ anana$ nana$ banana$ ana nana$ na$ $ 7 6 5 banana$ a na $ banana$ a na banana$ ana na na $ na$ $ na $ na$ $ na$ $ na$ $ na$ $ na$ $ . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 10 / 25
  • 11. Construct Suffix Tree by Sorting Suffix Suffix : Sorted suffix : Tree of sorted suffix : mississippi i | -i - >| - ississippi ippi | | - ppi ssissippi issippi | | - ssi - >| - ppi sissippi ississippi | | - ssippi issippi mississippi | - mississippi ssippi pi | -p - >| - i sippi ppi | | - pi ippi sippi | -s - >| -i - - >| - ppi ppi sissippi | | - ssippi pi ssippi | - si - >| - ppi i ssissippi | - ssippi Time complexity will be O(N 2 log N) . Space complexity will be O(N 2 ) . . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 11 / 25
  • 12. Naïve Algorithm S UFFIX T REE(string) 1 for i ← 1 to length(string) 2 do U PDATE(treei ) £ Phrase i U PDATE(treei ) 1 for j ← 1 to i 2 do node ← treei .F IND(suffix[j to i − 1]) 3 E XTEND(node, string[i]) £ Extension j Time complexity will be O(N 3 ) . Space complexity will be O(N 2 ) . The challenge is to make sure treei is updated to treei+1 efficiently. . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 12 / 25
  • 13. Suffix Extend Cases of Naïve Algorithm . Case I If path Sj...i ends at leaf, append a char Si+1 to end of edge into leaf. Sj...i = . . . na Sj...i+1 = . . . nan na nan Case II If path Sj...i ends in the middle of an edge , and next char Si+1 is not equal to the next char in the edge, split that edge, create a internal node, add a new edge to a new leaf. Sj...i = . . . na Sj...i+1 = . . . nay n nan na y Case III If path Sj...i ends in the middle of an edge , and next char Si+1 is equal to the next char in the edge, do nothing, extenstion has done. Sj...i = . . . na Sj...i+1 = . . . nan nan nan . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 13 / 25
  • 14. Naïve Online Construction of Suffix Tree 1 2 3 4 b ba a ban an n bana ana na 7 6 5 banana$ a na $ banana anana nana banan anan nan na $ na$ $ na$ $ . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 14 / 25
  • 15. Properties of Suffix Tree . . Each update will add exactly 1 1 leaf node . 0:6 nr_leaf = N . Suffix tree is full tree. 2 banana$ a na $ Each internal node has at least 2 children. 1:0 8:6 6:6 10:6 nr_internal < N nr_node < 2N na $ na$ $ . Worst case Fabonacci word 3 4:6 9:6 3:2 7:6 abaababaabaab . Suffix is either explicit or implicit. 4 na$ $ Explicit when it ends at a node. 2:1 5:6 Implicit when it ends in the middle of an edge. . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 15 / 25
  • 16. Part Outline . 1 What is Suffix Tree . 2 History & Naïve Algorithm . 3 Optimization of Naïve Algorithm . 4 Examples & Analysis . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 16 / 25
  • 17. Optimization of Naïve Algorithm . . Substrings can be represented as (start, end) pair of their index in 1 orignal string. Reduce space complexity to O(N) if size of alphabet is fixed constant. . Once a leaf, Always a leaf 2 Represent edge that links to a leaf as (start, · · · ). Extend leaf nodes for free. We do not need Extend Case I. 0:6 0:6 banana$ a na $ (1,2) (0,...) (6,...) (2,4) a banana$... $... na 1:0 8:6 6:6 10:6 8:6 1:0 10:6 6:6 (6,...) (2,4) (6,...) (4,...) na $ na$ $ $... na $... na$... 4:6 9:6 3:2 7:6 9:6 4:6 7:6 3:2 (6,...) (4,...) na$ $ $... na$... 5:6 2:1 2:1 5:6 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 17 / 25
  • 18. Active Point During a phrase, if we meet Extend Case III, that is if we found S[i + 1] already exists in suffix[j . . . i] then S[i + 1] will exists in ∀suffix[k . . . i], k ∈ j . . . i. Thus Case III is a sign that means update of this phrase is finished. During phrase i if we stopped at suffix[k . . . i] by Case III, then in next phrase we can start from suffix[k . . . i + 1] because all suffix start with 1 . . . k − 1 will end at Case I. We called this point(current internal node, current position k in string) as Active Point. a ab aba abaa abaab abaaba abaabab abaababa abaababaa abaababaab abaababaaba abaababaabaa abaababaaba ab 0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9 0:10 0:11 0:12 (0,...) (0,...) (1,...) (0,...) (1,...) (0,1) (1,...) (0,1) (1,...) (0,1) (1,...) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) (0,1) (1,3) a... ab... b... aba... ba... a baa... a baab... a baaba... a ba a ba a ba a ba a ba a ba a ba 1:0 1:0 2:1 1:0 2:1 3:3 2:1 3:3 2:1 3:3 2:1 3:3 7:6 3:3 7:6 3:3 7:6 3:3 7:6 3:3 7:6 3:3 7:6 3:3 7:6 (3,...) (1,...) (3,...) (1,...) (3,...) (1,...) (3,...) (1,3) (3,...) (6,...) (3,...) (1,3) (3,...) (6,...) (3,...) (1,3) (3,...) (6,...) (3,...) (1,3) (3,...) (6,...) (3,...) (1,3) (3,...) (6,...) (3,6) (1,3) (3,6) (6,...) (3,6) (1,3) (3,6) (6,...) a... baa... ab... baab... aba... baaba... abab... ba abab... b... ababa... ba ababa... ba... ababaa... ba ababaa... baa... ababaab... ba ababaab... baab... ababaaba... ba ababaaba... baaba... aba ba aba baabaa... aba ba aba baabaab... 4:3 1:0 4:3 1:0 4:3 1:0 4:3 5:6 2:1 8:6 4:3 5:6 2:1 8:6 4:3 5:6 2:1 8:6 4:3 5:6 2:1 8:6 4:3 5:6 2:1 8:6 13:11 5:6 11:11 8:6 13:11 5:6 11:11 8:6 (3,...) (6,...) (3,...) (6,...) (3,...) (6,...) (3,...) (6,...) (3,...) (6,...) (11,...) (6,...) (3,6) (6,...) (11,...) (6,...) (11,...) (6,...) (3,6) (6,...) (11,...) (6,...) abab... b... ababa... ba... ababaa... baa... ababaab... baab... ababaaba... baaba... a... baabaa... aba baabaa... a... baabaa... ab... baabaab... aba baabaab... ab... baabaab... 1:0 6:6 1:0 6:6 1:0 6:6 1:0 6:6 1:0 6:6 14:11 4:3 9:11 6:6 12:11 2:1 14:11 4:3 9:11 6:6 12:11 2:1 (11,...) (6,...) (11,...) (6,...) a... baabaa... ab... baabaab... 10:11 1:0 10:11 1:0 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 18 / 25
  • 19. Ukk’s Update using Active Point U PDATE(treei ) 1 current_suffix ← active_point 2 next_char ← string[i] 3 while True 4 do if there exists edge start with next_char 5 then break £ Case III 6 else 7 split current edge if implicit 8 create new leaf with new edge labelled next_char 9 if current_suffix is empty 10 then break 11 else current_suffix ← next shorter suffix 12 active_point ← current_suffix . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 19 / 25
  • 20. Suffix Link to find next shorter suffix Suffix link Internal node of suffix X α has a link to node α . If α is empty, suffix link points to root. How to create suffix link Link together every internal nodes that are created by splitting in same phrase. banana$ banana$ 0:6 0:6 (1,2) (0,...) (6,...) (2,4) (0,...) (1,2) (6,...) (2,4) a banana$... $... na banana$... a $... na 8:6 1:0 10:6 6:6 1:0 8:6 10:6 6:6 (6,...) (2,4) (6,...) (4,...) (6,...) (2,4) (6,...) (4,...) $... na $... na$... $... na $... na$... 9:6 4:6 7:6 3:2 9:6 4:6 7:6 3:2 (6,...) (4,...) (6,...) (4,...) $... na$... $... na$... 5:6 2:1 5:6 2:1 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 20 / 25
  • 21. Fast jump using Suffix Link Assume we are at Suffix X αβ , whose parent internal node represent X α . 1. Go back to parent internal node, 2. Jumping follow the node’s suffix link to the node represent Suffix α 3. Go down to Suffix αβ . Even jump down in step 3 because we already know length of β . (Skip/Count trick) Combining all these tricks we can Extend a character in O(1) time. . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 21 / 25
  • 22. Part Outline . 1 What is Suffix Tree . 2 History & Naïve Algorithm . 3 Optimization of Naïve Algorithm . 4 Examples & Analysis . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 22 / 25
  • 23. Experiment – mississippi . m 0:0 (0,...) m... 1:0 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 24. Experiment – mississippi . mi 0:1 (1,...) (0,...) i... mi... 2:1 1:0 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 25. Experiment – mississippi . mis 0:2 (1,...) (2,...) (0,...) is... s... mis... 2:1 3:2 1:0 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 26. Experiment – mississippi . miss 0:3 (1,...) (2,...) (0,...) iss... ss... miss... 2:1 3:2 1:0 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 27. Experiment – mississippi . miss i 0:4 (1,...) (0,...) (2,3) issi... missi... s 2:1 1:0 4:4 (4,...) (3,...) i... si... 5:4 3:2 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 28. Experiment – mississippi . miss is 0:5 (1,...) (0,...) (2,3) issis... missis... s 2:1 1:0 4:4 (4,...) (3,...) is... sis... 5:4 3:2 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 29. Experiment – mississippi . miss iss 0:6 (1,...) (0,...) (2,3) ississ... mississ... s 2:1 1:0 4:4 (4,...) (3,...) iss... siss... 5:4 3:2 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 30. Experiment – mississippi . mississi 0:7 (1,...) (0,...) (2,3) ississi... mississi... s 2:1 1:0 4:4 (4,...) (3,...) issi... sissi... 5:4 3:2 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 31. Experiment – mississippi . mississip 0:8 (8,...) (1,2) (0,...) (2,3) p... i mississi... s 14:8 12:8 1:0 4:4 (8,...) (2,5) p... ssi (3,5) (4,5) 13:8 6:8 si i (8,...) (5,...) p... ssip... 7:8 2:1 8:8 10:8 (8,...) (5,...) (8,...) (5,...) p... ssip... p... ssip... 9:8 3:2 11:8 5:4 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 32. Experiment – mississippi . mississip p 0:9 (8,...) (1,2) (0,...) (2,3) pp... i mississi... s 14:8 12:8 1:0 4:4 (8,...) (2,5) pp... ssi (3,5) (4,5) 13:8 6:8 si i (8,...) (5,...) pp... ssipp... 7:8 2:1 8:8 10:8 (8,...) (5,...) (8,...) (5,...) pp... ssipp... pp... ssipp... 9:8 3:2 11:8 5:4 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 33. Experiment – mississippi . mississippi 0:10 (1,2) (0,...) (8,9) (2,3) i mississi... p s 12:8 1:0 15:10 4:4 (2,5) (8,...) (9,...) (10,...) ssi ppi... pi... i... (3,5) (4,5) 6:8 13:8 14:8 16:10 si i (8,...) (5,...) ppi... ssippi... 7:8 2:1 8:8 10:8 (8,...) (5,...) (8,...) (5,...) ppi... ssippi... ppi... ssippi... 9:8 3:2 11:8 5:4 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 34. Experiment – mississippi . mississippi$ 0:11 (1,2) (0,...) (8,9) (11,...) (2,3) i mississi... p $... s 12:8 1:0 15:10 18:11 4:4 (8,...) (2,5) (11,...) (10,...) (9,...) ppi$... ssi $... i$... pi$... (3,5) (4,5) 13:8 6:8 17:11 16:10 14:8 si i (8,...) (5,...) ppi$... ssippi$... 7:8 2:1 8:8 10:8 (8,...) (5,...) (8,...) (5,...) ppi$... ssippi$... ppi$... ssippi$... 9:8 3:2 11:8 5:4 . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25
  • 35. Time Complexity Analysis $ i p p i s s i s s i m . m i s s i s s i p p i $ Time complexity is 2N = O(N) . . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 24 / 25
  • 36. Experiment – English text . 16 "data" using 1:3 14 Construction Time (sec.) 12 10 8 6 4 2 0 0 20000 40000 60000 80000 100000 120000 140000 File size (byte) . . . . . . jc-yang (Osaka-U) SuffixTree November 12, 2011 25 / 25