Understanding Suffix Automaton in depth

#	User	Rating
1	tourist	3985
2	jiangly	3814
3	jqdai0815	3682
4	Benq	3529
5	orzdevinwang	3526
6	ksun48	3517
7	Radewoosh	3410
8	hos.lyric	3399
9	ecnerwala	3392
9	Um_nik	3392

#	User	Contrib.
1	cry	169
2	maomao90	162
2	Um_nik	162
4	atcoder_official	161
5	djm03178	158
6	-is-this-fft-	157
7	adamant	155
8	Dominater069	154
8	awoo	154
10	luogu_official	150

I am having a lot of trouble with understanding suffix automaton with all it's details.

I have solved about 5 problems that contained very basic applications for it and I am stuck at some points.

I can't really understand how suffix links works, and what the congruence classes are. I know that the best resource is e-maxx site but actually I don't understand Russian and the translators (Google, Yandex , ... etc) sucks. Proofs are translated grammatically wrong and I really can't understand it in depth. The other resources like Crochemore's books are nice but they are actually long and they go beyond what an ACMer needs to understand a data structure.

So, it would be very nice if some one can explain suffix links and congruence classes to me.

Also, there are lots of people do calculate something called the right array in their codes and their comments are either Russian or Chinese and I can't really understand what and why is those 3 loops there.

I hope that someone can really explain all that.

for (int i = 0,p = S; i < n; ++i) ++r[p = ch[p][s[i] — 'a']]; static int b[N],t[N]; for (int i = 1; i <= cnt; ++i) ++b[l[i]]; for (int i = 1; i <= n; ++i) b[i] += b[i - 1]; for (int i = 1; i <= cnt; ++i) t[b[l[i]]--] = i; for (int i = cnt; i; --i) r[fa[t[i]]] += r[t[i]];

int occur[max_size]; // occur[state] = 1 if state is prefix of string vector<int> state_with_len[max_len]; for(int state = 0; state < state_count; state++) state_with_len[len[state]].push_back(state); // If you traverse states from bigger length to smallest you will have // Reversed topological order. The same which you use in dfs actually. for(int length = max_len - 1; length > 0; length--) for(auto state: state_with_len[length]) occur[link[state]] += occur[state];

Comments (17)

Write comment?

Safrout

9 years ago, # |

Am I asking for something stupid so that people are ignoring to answer or am I asking for an extremely difficult stuff that no one can answer that?!

I really want to know if I do have problems.

→ Reply

adamant

9 years ago, # ^ |

← Rev. 2 →

+14

Well, I suppose that suffix structures are extremely not interesting for most of people here :)

Well, ok. I can help you, but you need to come up with some certain questions for it.

By the way, what is "right array"? Can you give me some russian comment where it is mentioned?

← Rev. 9 →

Well, what I need is the following :

1- A good explanation for suffix links (some examples that clarifies what an end position class is). I know that the suffix link points to the node with the largest suffix of my node that is not included in my end position class. but why is that and what is that used for ?

2- I saw these loops in lots of people solutions who use Suffix Automaton and I really couldn't understand what does they do (They look useful).

for (int i = 0,p = S; i < n; ++i)
	++r[p = ch[p][s[i] &mdash; 'a']];
static int b[N],t[N];
for (int i = 1; i <= cnt; ++i) ++b[l[i]];
for (int i = 1; i <= n; ++i) b[i] += b[i - 1];
for (int i = 1; i <= cnt; ++i) t[b[l[i]]--] = i;
for (int i = cnt; i; --i) r[fa[t[i]]] += r[t[i]];

3- An explanation for generating the longest common substring of a set of strings (if getting the solution is easy after understanding the last 2 points then don't answer this point).

Thank you very much

Omg, I don't know for sure but second code looks like counting sort from suffix array. Can you refer to some comments or submissions with it, please?

They are written in these 2 code :

1- A solution for SPOJ NSUBSTR problem :

https://gist.github.com/lazycal/11244808

2- A solution for SPOJ LCS2 problem :

https://gist.github.com/foreseeable/6266824

Ok, I think, I understood what this code states for. This is very weird way to do this thing but anyway.

==1==

for (int i = 0,p = S; i < n; ++i)
	++r[p = ch[p][s[i] - 'a']];

You want to count for every state of suffix automata how many times does it appear in string. It will be array r. You initialize it with 1 in every state which corresponds to some prefix of string.

==2==

static int b[N],t[N];
for (int i = 1; i <= cnt; ++i) ++b[l[i]];
for (int i = 1; i <= n; ++i) b[i] += b[i - 1];
for (int i = 1; i <= cnt; ++i) t[b[l[i]]--] = i;

Actually in this part you obtain lexicographical order of states. It is equivalent to counting sort by lengths of states. At the end array t contains topological sort of states.

==3==

for (int i = cnt; i; --i) r[fa[t[i]]] += r[t[i]];

Finally you finish calculating your array r by pushing initial values through the suffix links of automaton in reversed topological sort order. This procedure is described on e-maxx, I suppose.

==4==

Here is equivalent simplificated code for such thing.

P.S. I suppose this is Chinese code? Such crazy one :)

Yes it is a Chinese code indeed. Thank you very much for your time.

I am really happy to talk to you again. BTW you are first guy to point to me that Suffix Automaton actually exists. So, thank you for that too.

Yeah, I remember. You're welcome :)

+13

I wrote about endpos concept here. Please write if more clarification is needed.

Now I understand the definition clearly. Thank you for that. I will spend some time understanding why is that useful. If you have any nice references on understanding why is it useful that would nice. Thank you again.

Well, I know how to use them, but it is hard to me to clearly explain things it is based on. It is kind of intuitive for me. And I still have references except e-maxx. Maybe I'll write entry about suffix automaton but very next time...

The main thing why it is very very useful is the fact that suffix links of automaton of string s is simply suffix tree of reversed s. This makes suffix automaton incredibly powerful structure.

It somehow became more clear after pointing to the suffix tree thing.

I will be waiting for your blog entry.

Thank you very much.

I will bother you with one more question. From where did you get that experience and intuition. What have you gone through to get it ?!

Also please include the other references if you can.

For the first time suffix automaton was explained to me by I_love_natalia. After that I solved a lot of problem, sometimes using some information from e-maxx and/or my friends... That's all, I suppose.

Okay, so can you give me the problems that made a difference to your understanding (if you remember them for sure) ?

This and this were useful. BUt actually I solved very big amount of problems...

elita15

[user:adamant]Hi,I understood the suffix automata algorithm given on e-maxx however I am not able to understand how to find longest common substring of multiple string. (The translation of that part isn't very clear :( ). Could you please give me a brief idea on how suffix automata is used to solve this problem? (SPOJ-LCS2) Thanks in advance.

Safrout's blog