err0r's blog

By err0r, history, 7 years ago, In English

Recently i learnt Aho-Corasick Algorithm for Pattern Searching. To find all occurrences of pattern strings in text string. i store all possible output links in my prefix tree(trie)

But case such as
text = aaaaa
pattern = {a, aa, aaa, aaaa, aaaaa}
will give me huge number of output link!

my implementation is there a better way to store them?

  • Vote: I like it
  • +8
  • Vote: I do not like it

»
7 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by err0r (previous revision, new revision, compare).

»
7 years ago, # |
  Vote: I like it +3 Vote: I do not like it

Store in a node only the indices of the pattern, that match exactly (the letters of the path from the root to this node spell exactly the pattern).

In this way you only store at most one pattern for each node (except if there are duplicate patterns).

All other pattern that match in a certain node, are suffixes of the exact matching pattern. Therefore if you want to find all matches that end in a certain node, you can just walk along the suffix links and print all the stored indices that you find along these nodes.

Notice, that this can be quite inefficient. E.g. if you have the patterns {a, aaaaaaaaaaaa}. There are 13 nodes in the trie, and to find all matches that end at the end of the text aaaaaaaaaaaa you would have to visit all 13 nodes, although there are only 2 matches.

One solution. You can precompute a second suffix link for each node, which points to the next terminal node (if there exists such a node) on the suffix link path. Then you can just print the (possible empty) output vector, go to the next terminal node via the link, print the (guaranteed non-empty) output vector, etc. This uses only O(|nodes|) more memory, O(|nodes|) more precomputation time, and finding all matches ending in a certain node is still O(|matches|).

You can see it in my implementaton. Notice that I assume that there are no duplicate patterns, so it only stores one possible pattern index for each node.