#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

#	User	Contrib.
1	cry	167
2	-is-this-fft-	162
3	Dominater069	160
4	Um_nik	158
5	atcoder_official	157
6	Qingyu	156
7	djm03178	151
7	adamant	151
9	luogu_official	150
10	awoo	147

Prologue:

OK, so I finally decided to learn Suffix Tree. It's a monster I'd never dared to look in the eyes, I had always gotten away with using Suffix Array with the tail between my legs until now. The thing is that some problems require a more efficient data structure, so I decided to leave my fears behind and learn it once and for all.

First of all, I read about it in a few books and some papers online. When I understood its final structure works and how to work with it, I searched an implementation online (no, I didn't plan on coding everything from scratch). I found a good and efficient implementation on Geeks for Geeks, but unfortunately it was written in C and it worked with native types.

The tweaking and testing:

I worked with the code and tweaked it here and there. I removed unnecessary comments and upgraded the code to C++ to make it work with strings instead of char pointers. I also added some extra functionality that didn't originally come with the implementation (_Longest Common Substring_, Longest Repeated Substring, Occurence Count, etc.). I finally wrapped everything into a class and went on to testing.

Everything worked fine and very fast too. I solved a few problems, but then I came across some other ones that couldn't be solved 'cause I realized that the class I had in my hands worked only with one single string. A lot of problems that are solved with Suffix Trees require a different kind of data structure, one that supports working with multiple strings. That data structure, I learned, is called Generalized Suffix Tree. This is where the problems began.

Generalized Suffix Tree:

I searched online about it, but all the papers and articles I found built it by adding a different delimiter character for each word. The problem with this implementation is that you can work with a very small number of words. A lot of problems give you 500, 100 or even 10⁴ strings. There aren't that many characters.

My idea was to use a single delimiter character for every word, but then do a DFS and count how many delimiters I pass along the way. When I reach a leaf, I know what word this suffix belongs to by the amount of delimiters I passed.

So the problem boils down to this: We have a tree with numbers written on the leaves. How do we efficiently compute for every inner node how many different numbers are written on the leaves in its subtree?

My workaround (the best idea that crossed my mind) was to do a DFS and maintain a set for each node with the numbers I found on its subtree. By merging smaller sets into bigger ones and upgrading pointers instead of moving everything from one place to another, I was able to achieve something in the lines of O(N_log²N), but the problem is that Suffix Array + LCP + Range Query ends up being faster than this Generalized Suffix Tree. That's not very nice after all the effort I put into it. But I still have hopes, I think there has to be a better way out there.

Final question:

How do you guys implement a Generalized Suffix Tree and how do you efficiently compute for every inner node, how many words this suffix belongs to?

Thanks very much in advance.

Rev.	By	When	Δ	Comment
en3	tenshi_kanade	2016-09-28 06:38:59	8
en2	tenshi_kanade	2016-09-28 06:37:23	6	Tiny change: 'structure works and how t' -> 'structure and how t'
en1	tenshi_kanade	2016-09-28 06:36:30	3324	Initial revision (published)

Prologue:

The tweaking and testing:

Generalized Suffix Tree:

Final question:

History