juruo_lzy's blog

By juruo_lzy, history, 12 months ago, In English
  • Given $$$n$$$ strings $$$s_1\sim s_n$$$, find how many pairs $$$(i,j)$$$, satisfy:
  • $$$1\le i,j\le n$$$.
  • $$$s_j$$$ is a true substring of $$$s_i$$$.
  • There exists no $$$k$$$ ($$$i,j,k$$$ two different) such that $$$s_j$$$ is a true substring of $$$s_k$$$ and $$$s_k$$$ is a true substring of $$$s_i$$$.
  • $$$n,\sum\limits_{i=1}^n|s_i|\le 10^6$$$. If $$$i\ne j$$$, then $$$s_i\ne s_j$$$.
  • $$$\text{2 s / 512 MB}$$$.

Suffix-order all strings by first collapsing them into one large string $$$S$$$. Consider enumerating $$$i$$$ to find out which $$$j$$$ will contribute.

Consider for each suffix $$$s_i$$$ of $$$s_i[x,|s_i|]$$$, find a longest string $$$s_y$$$ in $$$s_1\sim s_n$$$ that satisfies that it is a true prefix of $$$s_i[x,|s_i|]$$$, denoted as $$$l_{i,x}=|s_y|$$$. If no such $$$s_y$$$ can be found, $$$l_{i,x}$$$ is said to be meaningless. If some $$$l_{i,x}$$$ can appear in an equation or inequality for an operation, $$$l_{i,x}$$$ is meaningful if and only if $$$l_{i,x}$$$ is meaningful.

Construct a binary irreducible set $$$T_i$$$ where a binary $$$(u,v)$$$ appears in $$$T_i$$$ if and only if the following four conditions are satisfied:

  • $$$1\le u,v\le |s_i|$$$.
  • $$$l_{i,u}$$$ makes sense.
  • $$$v=u+l_{i,u}-1$$$.
  • There exists no $$$w\in[1,u)$$$ such that $$$u+l_{i,u}-1\le w+l_{i,w}-1$$$.

Call $$$s_j$$$ occurs in $$$T_i$$$ if and only if there exists $$$(u,v)\in T_i$$$ such that $$$s_i[u,v]=s_j$$$.

Construct another binary irreducible set $$$R_{i,j}$$$ where a binary $$$(u,v)$$$ occurs in $$$R_{i,j}$$$ if and only if the following two conditions are satisfied:

  • $$$1\le u\le v\le |s_i|$$$.
  • $$$s_i[u,v]=s_j$$$.

Then the binary $$$(i,j)$$$ in the original title satisfies the condition, if and only if $$$\boldsymbol{R_{i,j}\ne \varnothing}$$$ and $$$\boldsymbol{R_{i,j}\subseteq T_i}$$$.

$$$R_{i,j}\ne \varnothing$$$ Obviously, there would be no proof.

Proof:

  • Sufficiency:

Consider the converse, assuming that $$$k$$$ exists when $$$R_{i,j} \subseteq T_i$$$ such that $$$s_j$$$ is a substring of $$$s_k$$$ and $$$s_k$$$ is a true substring of $$$s_i$$$.

Let $$$s_i[a,b]=s_k$$$ and $$$s_k[c,d]=s_j$$$. Then $$$s_i[a+c-1,a+d-1]=s_j$$$. From the known conditions we get $$$(a+c-1,a+d-1)\in R_{i,j},T_i$$$.

If $$$l_{i,a+c-1}\ne d-c+1$$$, it does not match the third condition of $$$(a+c-1,a+d-1)\in T_i$$$.

Otherwise, at this point $$$c>1$$$. By the definition of $$$l_{i,a}$$$ it follows that $$$l_{i,a}\ge b-a+1$$$, i.e. $$$a+l_{i,a}-1\ge b$$$. But $$$a+c-1+l_{i,a+c-1}-1=a+c-1+d-c+1-1=a+d-1$$$. From $$$d\le b-a+1$$$ we can get $$$a+d-1\le b$$$. Inconsistent with the fourth condition of $$$(a+c-1,a+d-1)\in T_i$$$.

Therefore the assumption is not valid. When $$$R_{i,j} \subseteq T_i$$$ there must be no $$$k$$$ such that $$$s_j$$$ is a substring of $$$s_k$$$ and $$$s_k$$$ is a true substring of $$$s_i$$$.

  • Necessity:

Consider $$$(u,v)\in R_{i,j}$$$ but $$$(u,v)\notin T_i$$$. At this point $$$(u,v)$$$ must satisfy the first two conditions for some binary to be in $$$T_i$$$.

If $$$l_{i,u}\ne v-u+1$$$, then there is a longer string $$$s_y$$$ that is a prefix of $$$s_i[u,|s_i|]$$$. At this point $$$s_j$$$ is a true substring of $$$s_y$$$.

Otherwise, there must exist $$$w\in[1,u)$$$ such that $$$u+l_{i,u}-1\le w+l_{i,w}-1$$$. Show that there exists a string $$$s_y=s_i[w,w+l_{i,w}-1]$$$. At this point $$$s_y$$$ contains $$$s_j$$$ in the middle, i.e., $$$s_j$$$ is a true substring of $$$s_y$$$ (which is guaranteed to be true because $$$w\in[1,u)$$$).

So not satisfying $$$R_{i,j}\subseteq T_i$$$ must not satisfy the original condition, which is a necessary condition.

This conclusion is not enough, it is always impossible to derive the set and then enumerate the judgments.

Further reasoning yields that it is actually equivalent to $$$\boldsymbol{\sum\limits_{(u,v)\in T_i}P(s_i[u,v]=s_j)=|R_{i,j}|}$$$.

To distinguish between middle and Iverson brackets, $$$P(A)$$$ is used to indicate whether the $$$A$$$ proposition is true. $$$P(A)=1$$$ if and only if $$$A$$$ is true; $$$P(A)=0$$$ if and only if $$$A$$$ is false.

Why? It is not hard to find $$$\sum\limits_{(u,v)\in T_i}P(s_i[u,v]=s_j)=|T_i\bigcap R_{i,j}|\le |R_{i,j}|$$$. And $$$|T_i\bigcap R_{i,j}|=|R_{i,j}|$$$ is equivalent to $$$R_{i,j}\subseteq T_i$$$.

Then we only need to find $$$\sum\limits_{(u,v)\in T_i}P(s_i[u,v]=s_j)$$$ and $$$|R_{i,j}|$$$ for a $$$j$$$.

  • The former is solved:

First get $$$T_i$$$. Consider for each string $$$s_a$$$, which ranked suffix's it will contribute to. This suffix has to contain $$$s_a$$$, which is equivalent to both $$$|\text{LCP}|\ge |s_a|$$$. One can maintain the ST table of the $$$\text{height}$$$ array and then bisect it to get the ranking interval, letting the $$$l$$$ values in this interval take $$$\max$$$ for $$$|s_a|$$$. Line segment tree maintenance is sufficient.

It can then be swept from left to right to maintain the $$$u+l_{i,u}-1\le w+l_{i,w}-1$$$ maximum value of the prefix $$$pre$$$. Query at a single point in the line tree for the current suffix ranking that position worth to $$$l_{i,x}$$$. If $$$x+l_{i,x}-1>pre$$$, add $$$(x,x+l_{i,x}-1)$$$ to $$$T_i$$$.

Then use buckets to maintain the longest prefix of how many suffixes each of the strings in $$$s_1\sim s_n$$$ in $$$T_i$$$ is used as.

  • The latter is solved for:

Consider $$$s_j$$$ appearing as the prefix of some suffix. the same can be done to find the ranked interval of the suffixes that contain it. It then becomes a question of how many rankings in the interval are such that the suffix of that ranking comes from the string numbered $$$i$$$.

For each $$$i$$$, a vector is used to store the suffix occurrences from smallest to largest, bisecting the left and right endpoints $$$l,r$$$, and the answer is $$$r-l+1$$$.

This still requires enumerating $$$j$$$. But notice:

Since $$$R_{i,j}\ne \varnothing$$$, those $$$s_j$$$ that don't appear in $$$T_i$$$ must not contribute. Because at this point $$$|T_i\bigcap R_{i,j}|=0< |R_{i,j}|$$$. Only those $$$s_j$$$ that occur in $$$T_i$$$ need to be considered. And for each $$$u$$$, there will be only one $$$v$$$ such that $$$(u,v)\in T_i$$$, hence $$$|T_i|=\mathcal{O}(|s_i|)$$$. This results in a total of only $$$\mathcal{O}(|S|)$$$ times to be processed.

In order not to count the recalculation omissions, consider that for each string occurring in $$$T_i$$$, it is counted at the position of its first occurrence.

Finally, do this for each string in $$$s_1\sim s_n$$$ and you'll get the correct answer.

The space-time complexity is all $$$\mathcal{O}(|S|\log |S|)$$$.

AC code

  • Vote: I like it
  • +14
  • Vote: I do not like it