adamant's blog

By adamant, 10 years ago, translation, In English

First part, I guess. Even if you think that you are familiar with suffix tree, please, take a look at the code below. It may be interesting to you.

Hi everyone! Finally I learnt this one :)

In this entry I would like to avoid long and complex theory which scared me from suffix tree for a long time. So straight to the point. I will not prove algorithm if you want some proofs, you may check stackoverflow or Dan Gusfield's book... Or somewhere else, I don't know.

Suffix tree is a compressed suffix trie, so all vertices which are not corresponding to suffixes and which have only one descendant are omitted.

Now about the algorithm. At each iteration, it makes implicit suffix tree. In implicit suffix tree all vertices which have only one descendant are omitted. Usually edges are stored as a pair of [Li, Ri]. Personally I am not very convenient to work with them in this way, so I suggest to store in each node some data corresponding to the edge from its ancestor to it — fposi & which is the left position of first edge occurence in the string and leni which is the length of the edge. In this case, the length of the edges, leading to leaves will by default be considered equal to inf. So we can be sure that at any time the edges to the leaves are correct. Root of the tree will be the vertex numbered 0.

Let's at each step of the algorithm keep the longest non-unique suffix of the string. To do this, let's keep a pair of numbers (node, pos) & mdash; vertex in the suffix tree and the number of characters that you need to pass down from it to have this suffix. By default node = pos = 0. When you append a new symbol, let's increase pos by 1 and add all of new unique suffix of the string that appear after adding a new character.

Also, we need the concept of suffix links. It is defined for internal nodes of the tree. Following suffix link will lead to the vertex corresponding to the same substring, but without first character. For the root vertex suffix link is not defined.

Appending of new character consists of the following stages:

  1. If pos = 0, then all suffixes are added. Return. Otherwise let's find the vertex after which new suffix will be added (it is not neccessarily node because edge from node may be too short). So while pos greater then edge from node let's follow this edge and substract its length from pos.
  2. Now let's try to add new suffix. We will have three options here:
    1. If we do not have needed outgoing edges at this node, we simply create a new vertex and hung it to the current one.
    2. If there is an edge and a suffix that we want to add lies entirely on it then this and further suffixes are not unique. Return.
    3. If there is an edge and suffix doesn't lie entirely on it then it differs in only one character, this means that we need to create a new vertex in the middle of the edge and then create another one new vertex (which will be new suffix) and hung it to the vertex in the middle of splitted edge.
  3. If you have not returned on the previous step, go to the next suffix. If node is root, then we reduce the pos with 1, otherwise we just follow the suffix link node = link(node) without changing pos. After that, we go to step 1.

And about siffix links. On the i-th step we will set suffix link of internal vertex created on i - 1-th step. If we create a new internal vertex (i.e. split some edge), then the suffix link will lead into it. In two other cases, the suffix link will lead to the node (I am too lazy to write truly marvellous proof of this, so it is left to the curious reader as an exercise).

And, finally the implementation.

Code: #sT8Vd1

I tried to make the code as simple and clear as possible :) I do not know if I managed to do so and hope for your feedback and questions.

  • Vote: I like it
  • +104
  • Vote: I do not like it

| Write comment?
»
10 years ago, # |
  Vote: I like it +11 Vote: I do not like it

Wow :) That's very short code. Thanks.

»
10 years ago, # |
  Vote: I like it 0 Vote: I do not like it

If you use unordered_map instead of map the complexity will be O(N), which is one of the crucial parts of the algorithm :)

  • »
    »
    10 years ago, # ^ |
      Vote: I like it +5 Vote: I do not like it

    From my experience if you need huge amount of associative containers, usual map is preferred. Also usual map is more useful when you need to traverse tree in lexicographical order.

    • »
      »
      »
      10 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Yes, you are right. And because the alphabet size is constant, using map is also constant. But you could use only one unordered_map<pair<int,int>,int> like this and have a guaranteed constant time data access. As you said though traversing the tree wouldn't be as good as it is with map.

»
9 years ago, # |
  Vote: I like it -16 Vote: I do not like it

Could you please also write a small function to search if a string exists as a substring in the original string ?

  • »
    »
    9 years ago, # ^ |
      Vote: I like it -17 Vote: I do not like it

    This is related to an on-going contest :|

    • »
      »
      »
      9 years ago, # ^ |
        Vote: I like it +47 Vote: I do not like it

      I'm so tired of things like that. Someone asks for some really well-known stuff and then someone else tells us that OMG NOOO THIS IS RELATED TO ON-GOING CONTEST.

      No, really. It is the problem of contest-maker to make some non-trivial problems so such well-known tricks won't be helpful.

»
8 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Can you suggest some problems on suffix trees. ?

»
7 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Hey can you please explain why you are updating link[last] to node in line number 49 of your code.

»
7 years ago, # |
  Vote: I like it -9 Vote: I do not like it

Is palindromic tree a special type of suffix tree?