Aho Corasick vs Suffix Trees

→ Pay attention

Before contest
Codeforces Round 1006 (Div. 3)
3 days
Register now »

→ Top rated

#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

Countries | Cities | Organizations

View all →

→ Top contributors

#	User	Contrib.
1	cry	167
2	-is-this-fft-	162
3	Dominater069	160
4	Um_nik	158
5	atcoder_official	157
6	Qingyu	156
7	djm03178	151
7	adamant	151
9	luogu_official	150
10	awoo	147

View all →

→ Find user

→ Recent actions

Detailed →

sdssudhu's blog

Aho Corasick vs Suffix Trees

By sdssudhu, history, 6 years ago, In English

I was recently reading about aho corasick and suffix trees. I felt that suffix tree operations are a superset of aho corasick operations. Is my assumption correct or can aho corasick perform some kind of query/operation that cannot be performed by suffix trees.

aho-corasick, suffix tree

sdssudhu
6 years ago
7

Comments (7)

Write comment?

LLI_E_P_JI_O_K

6 years ago, # |

Suffix tree contains the suffixes of 1 string, Aho–Corasick trie contains multiple (perhaps different) strings and all suffix links and other techniques allow to solve different kind of problems that seems not to be solved easily with suffix tree only, for example, find the number of integers in [A...B], 1<=A<=B<=1e1000 such as they don't contain S1,S2,S3,...,SN as substrings (S1...SN consist of decimal digits).

→ Reply

sdssudhu

6 years ago, # ^ |

Ok. I get it. So aho-corasick is useful when there are plenty of input strings.

→ Reply

Jakube

6 years ago, # ^ |

← Rev. 2 →

+16

A suffix tree and Aho-Corasick are based on different techniques. They can do similar stuff, like finding all appearances of a set of patterns in a given text.

The main difference is, that you create a suffix tree over the text, and then find appearances of patterns in the tree. In contrary you create the Aho-Corasick for the patterns, and then iterate over the chars in the text over the trie and you find all appearances that end at the current position. Depending on your need one of the structures can have an advantage over the other.

The structures can also be used to compute other tasks. However since one structure encodes just one string, and the other one a set of strings, each of the structures can compute tasks efficiently that the other structures cannot. E.g. the suffix tree can be used to compute the longest palindrome in the text, or the longest common substring. Or you can use Aho-Corasick to find the shortest string that contains all patterns in it.

As a final note, if you create an Aho-Corasick trie with just one pattern, the structure of the trie is a path with some additional suffix links. It is exactly the same as the graph induced by the Prefix algorithm. And you can see the Aho-Corasick as an extension to the Prefix function to work with multiple strings. If you know the Prefix function, you will understand Aho-Corasick very easily, it will almost feel trivial. Suffix trees however (especially their linear construction) are much harder to understand.

→ Reply

sdssudhu

6 years ago, # ^ |

Ok. I sort of get the differences between the 2 structures.

One more thing is that I read about aho-corasick from cp-algorithms. In that for finding all strings from a given set in a text they have mentioned something about a concept called exit link. I couldn't understand that part properly. Can you explain it or even if you have a code for it that would help me.

Thanks

→ Reply

Jakube

6 years ago, # ^ |

← Rev. 2 →

Actually I wrote that article on cp-algorithms. (Well, I translated it from e-maxx.ru). I'll try to add a few nice pictures and an extended explanation in the article tomorrow.

For now only something very short: During the construction of the trie you have marked all vertices that correspond to the end of a pattern and stored which patterns end in each vertex. Currently you are at vertex v (e.g. after processing the first x characters of your text). And you want to know all patterns end exactly at this position. You can find them by doing the following:

while (v != root) {
    print pattern ending in v
    v = suffix_link(v)
}

It should be obvious that this prints all patterns. A pattern that appears at that location is a suffix of the text and by following suffix links we check exactly all possible occurences.

Take a look at the following Aho-Corasick trie for the pattern aaa, aab, and a (red are positions where pattern end, blue are suffix links):

asdf

You currently processed the text xyzaaa and are at vertex 3. You print the pattern aaa, follow the link to 2, follow the link to 1 and print a and follow the link to 0. Notice that we visited 2 but actually accomplished nothing at that location, since no pattern ends at that vertex. Exit links are just suffix links that skip such redundant vertices. The exit link from 3 would point directly to 1. And therefore you are guaranteed to find all ans pattern that end at that position in the text in O(ans) time.

→ Reply

Jakube

6 years ago, # ^ |

← Rev. 2 →

My implementation of Aho-Corasick is here: Github Notice, I don't compute the array go on the fly, but instead do a BFS after adding all strings, during which I generate all suffix and exit links (here called next_terminal).

For reference, I learned the BFS approach from here.

→ Reply

sdssudhu

6 years ago, # ^ |

Thanks for taking the time for the explanation. I will go through these.

→ Reply