Suffix Array — CP Algorithms

Правка en1, от saelcc03, 2024-12-12 12:03:41

A friendly explanation of the implementation of Sufix Array CP-alg. Constructive comments are welcome ^-^

Recap

The algorithm will sort the cycle shifts of a string. For this purpose, we'll maintain arrays p and c on each iteration, where p[i] is the index of the i-th substring and c[i] is the class of the i-th substring, meaning that a substring with a lower class will be lexicographically smaller.

The algorithm will perform $$$log(n) + 1$$$ iterations $$$n$$$ times => $$$O(nlogn)$$$

0-th iteration

Probably this is not so complicated but just in case. The main idea of counting sort here is array cnt. Once you have the frequencies of every character in the string, do a prefix sum with cnt. Then, $$$p[--cnt[s[i]]] = i$$$ will just mean that the characters with higher frequency will be placed later p.

    vector<int> p(n), c(n), cnt(max(alphabet, n), 0);
    for (int i = 0; i < n; i++)
        cnt[s[i]]++;
    for (int i = 1; i < alphabet; i++)
        cnt[i] += cnt[i-1];
    for (int i = 0; i < n; i++)
        p[--cnt[s[i]]] = i;
    c[p[0]] = 0;
    int classes = 1;
    for (int i = 1; i < n; i++) {
        if (s[p[i]] != s[p[i-1]])
            classes++;
        c[p[i]] = classes - 1;
    }

Next iterations

We have done the 0-th iteration to get $$$p$$$ and $$$c$$$ arrays for substrings of lenght $$$1 \ll 0 = 1$$$. Now we'll make the following steps h times so $$$(1 \ll h) < n$$$, being $$$(1 \ll h)$$$ the length of the substring in each step.

The idea is that we need a way to compare substrings of length $$$2^k$$$ for each k. We already have a permutation array $$$p$$$ and a class array $$$c$$$ for substrings of length 1 (what we did in the 0-th iteration). So now we want to do get the arrays $$$p$$$ and $$$c$$$ for substrings of length $$$2^1$$$. We can see that a substring of length $$$2^1$$$ consists of two substrings of length $$$2^0$$$. A string of length $$$2^2$$$ consists of two strings of length $$$2^1$$$, and so on. If we are in the $$$(h-1)-th$$$ step, length $$$2^(h-1)$$$ in the i-th index, the class of the string of length $$$2^h$$$ starting at i will be determined by $$$c[i]$$$ and $$$c[i+2^(h-1)]$$$, but what about the new position (the value of the new p[i])? As described in the blog, it can be used a simple sort or couting sort over pairs, but it's slow. The faster approach is described below.

    vector<int> pn(n), cn(n);
    for (int h = 0; (1 << h) < n; ++h) {
        for (int i = 0; i < n; i++) {
            pn[i] = p[i] - (1 << h);
            if (pn[i] < 0)
                pn[i] += n;
        }
        fill(cnt.begin(), cnt.begin() + classes, 0);
        for (int i = 0; i < n; i++)
            cnt[c[pn[i]]]++;
        for (int i = 1; i < classes; i++)
            cnt[i] += cnt[i-1];
        for (int i = n-1; i >= 0; i--)
            p[--cnt[c[pn[i]]]] = pn[i];
        cn[p[0]] = 0;
        classes = 1;
        for (int i = 1; i < n; i++) {
            pair<int, int> cur = {c[p[i]], c[(p[i] + (1 << h)) % n]};
            pair<int, int> prev = {c[p[i-1]], c[(p[i-1] + (1 << h)) % n]};
            if (cur != prev)
                ++classes;
            cn[p[i]] = classes - 1;
        }
        c.swap(cn);
    }
    return p;
}

Probably the most tricky line is $$$p[--cnt[c[pn[i]]]] = pn[i]$$$. We'll cover that soon.

First we create auxiliar arrays $$$pn$$$ and $$$cn$$$. $$$pn[i]$$$ is just the position of the i-th substring in the sorted order order, minus the current length of the substring: $$$pn[i] = p[i] - (1 « h)$$$. In that way, we are sorting the pairs by the second element by only subtracting.

The array of frequencies $$$cnt$$$ will be computed the same ways as the previous step with the current class array $$$c$$$. Until here, the idea of counting sort is the same as the first block of code.

Now we've got to the line $$$p[--cnt[c[pn[i]]]] = pn[i]$$$. Let's explain it from outside to inside. It's very important to notice we are iterating from i = n-1 to 0. We are assigning new values to p so now p[i] is the index where the i-th substring of length $$$2^k$$$ starts, considering the order of the two substrings of length $$$2^(k-1)$$$ consists of. This index will be $$$pn[i]$$$. Why? Here it's important to look at two things happening. First, $$$cnt[c[pn[i]]]$$$ is the frequency of the class of $$$pn[i]$$$ (the left substring of the pair). Second, we are assigning that position in p to $$$pn[i]$$$ because it's the index where this new substring of length $$$2^h$$$ will start. Here a little example:

Imagine our string is s = "aababc" and we have completed the step for substring of length 2, now we want to sort strings of length 4.

$$$\newline$$$
$$$ s = a, a, b, a, b, c $$$
$$$ p = 0, 1, 3, 2, 4, 5 $$$
$$$ c = 0, 1, 2, 1, 3, 4 $$$

then the $$$pn$$$ will be:

$$$pn = 4, 5, 1, 0, 2, 3 $$$

and $$$cnt$$$:

$$$ cnt = 3, 5, 6 $$$

As we said, we'll iterate from i = n-1 to 0, because by doing that, the substring with higher frequencies will be assigned to lexicographically higher substrings. Let's look at the first assignment:

$$$\newline$$$
$$$ p[--cnt[c[3]]] = 3 $$$
$$$ p[2] = 3$$$

At position 3, starts "ab" substring of length 2, behind the highest substring of length 2 "ca". The frequency, (that is, the position in the sorted order) is given by "ab" (we used the array $$$c$$$ to get the frequency) and the index is given by the position of "ca" minus $$$2^(h-1) = 2$$$ (that is, position of "ab"). That's why it's also important to iterate from $$$i=n-1$$$ to $$$0$$$, because by doing that we are making sure that frequenciy/position of "ab" is assigned first to the highest lexicographically substring of length 2 "ca".

The final part is computing the new class array $$$cn$$$, very similar to the way it was computed in the 0-iteration.

Tbh, it took me a while to fully understand the algorithm, and explaining it was even more complicated. If there are any other points to mention, I'll be reading your feedback :>

Теги cp-algorithms, suffix array, explanation

История

 
 
 
 
Правки
 
 
  Rev. Язык Кто Когда Δ Комментарий
en4 Английский saelcc03 2024-12-14 04:14:13 688 Tiny change: '[c[pn[i]$ //index of ab$]]] = pn[' -> '[c[pn[i]$ //index of ab $]]] = pn['
en3 Английский saelcc03 2024-12-12 19:11:39 4
en2 Английский saelcc03 2024-12-12 19:10:15 18
en1 Английский saelcc03 2024-12-12 12:03:41 5977 Initial revision (published)