I was recently reading about aho corasick and suffix trees. I felt that suffix tree operations are a superset of aho corasick operations. Is my assumption correct or can aho corasick perform some kind of query/operation that cannot be performed by suffix trees.
Suffix tree contains the suffixes of 1 string, Aho–Corasick trie contains multiple (perhaps different) strings and all suffix links and other techniques allow to solve different kind of problems that seems not to be solved easily with suffix tree only, for example, find the number of integers in [A...B], 1<=A<=B<=1e1000 such as they don't contain S1,S2,S3,...,SN as substrings (S1...SN consist of decimal digits).
Ok. I get it. So aho-corasick is useful when there are plenty of input strings.
A suffix tree and Aho-Corasick are based on different techniques. They can do similar stuff, like finding all appearances of a set of patterns in a given text.
The main difference is, that you create a suffix tree over the text, and then find appearances of patterns in the tree. In contrary you create the Aho-Corasick for the patterns, and then iterate over the chars in the text over the trie and you find all appearances that end at the current position. Depending on your need one of the structures can have an advantage over the other.
The structures can also be used to compute other tasks. However since one structure encodes just one string, and the other one a set of strings, each of the structures can compute tasks efficiently that the other structures cannot. E.g. the suffix tree can be used to compute the longest palindrome in the text, or the longest common substring. Or you can use Aho-Corasick to find the shortest string that contains all patterns in it.
As a final note, if you create an Aho-Corasick trie with just one pattern, the structure of the trie is a path with some additional suffix links. It is exactly the same as the graph induced by the Prefix algorithm. And you can see the Aho-Corasick as an extension to the Prefix function to work with multiple strings. If you know the Prefix function, you will understand Aho-Corasick very easily, it will almost feel trivial. Suffix trees however (especially their linear construction) are much harder to understand.
Ok. I sort of get the differences between the 2 structures.
One more thing is that I read about aho-corasick from cp-algorithms. In that for finding all strings from a given set in a text they have mentioned something about a concept called exit link. I couldn't understand that part properly. Can you explain it or even if you have a code for it that would help me.
Thanks
Actually I wrote that article on cp-algorithms. (Well, I translated it from e-maxx.ru). I'll try to add a few nice pictures and an extended explanation in the article tomorrow.
For now only something very short: During the construction of the trie you have marked all vertices that correspond to the end of a pattern and stored which patterns end in each vertex. Currently you are at vertex v (e.g. after processing the first x characters of your text). And you want to know all patterns end exactly at this position. You can find them by doing the following:
It should be obvious that this prints all patterns. A pattern that appears at that location is a suffix of the text and by following suffix links we check exactly all possible occurences.
Take a look at the following Aho-Corasick trie for the pattern
aaa
,aab
, anda
(red are positions where pattern end, blue are suffix links):You currently processed the text
xyzaaa
and are at vertex3
. You print the patternaaa
, follow the link to2
, follow the link to1
and printa
and follow the link to0
. Notice that we visited2
but actually accomplished nothing at that location, since no pattern ends at that vertex. Exit links are just suffix links that skip such redundant vertices. The exit link from3
would point directly to1
. And therefore you are guaranteed to find allans
pattern that end at that position in the text inO(ans)
time.My implementation of Aho-Corasick is here: Github Notice, I don't compute the array
go
on the fly, but instead do a BFS after adding all strings, during which I generate all suffix and exit links (here callednext_terminal
).For reference, I learned the BFS approach from here.
Thanks for taking the time for the explanation. I will go through these.