Funny construction for Suffix Tree/Fast set split (Leader Split Trick)

#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

#	User	Contrib.
1	cry	168
2	-is-this-fft-	164
3	Dominater069	160
4	Um_nik	159
5	atcoder_official	158
6	djm03178	154
7	adamant	153
8	awoo	148
8	luogu_official	148
10	TheScrasse	146

Outline

Problem
Algorithm
Complexity Analysis
Code
Example Problems

Hello, in this blog I'll share a funny way to construct a suffix tree in $$$O(n \log^2{n})$$$ time, for a given string $$$S$$$ of length $$$n$$$. I am going to call the underlying idea the "Leader Split Trick". It can probably be used to solve other problems too.

Problem

A suffix tree of a string $$$S$$$ is a radix tree constructed from all the suffixes of $$$S$$$. It's easy to see that it has $$$O(n)$$$ nodes. It can be constructed in $$$O(n)$$$ using this.

I am going to share a simple and practically useless way of building it in a worse time complexity, $$$O(n\log^2{n})$$$.

Algorithm

Notation

Initially, we start with an empty tree (with a virtual root node), and a set $$$G$$$ of all suffixes from $$$1$$$ to $$$n$$$, these suffixes will be stored in the form of their starting index.

It's easy to see that the paths from the root node to $$$l_u \forall (u \in G)$$$ will share some common prefix till an internal node $$$s_G$$$, after which these paths will split apart along some downward edges of the internal node. Let's define $$$d_G$$$ to be the longest common prefix across the paths $$$(\text{root}, l_u) \forall u \in G$$$.

Our algorithm will essentially do the following:

Find $$$d_G$$$.
Split apart $$$G$$$ into disjoint subsets $$$G'$$$ (each subset $$$G'$$$ will have suffixes whose leaves lie in the subtree of a unique child node of $$$s_G$$$).
Solve the problem recursively for each subset, and add an edge in the suffix tree from $$$s_G$$$ to $$$s_{G'}$$$ for every $$$G'$$$.

Now, we define a recursive function $$$f(G, L, \text{dep}, \text{dis})$$$.

Definitions

In each call, $$$f(G, L, \text{dep}, \text{dis})$$$, we do the following:

If the "Leader" element $$$L$$$ is undefined:
Set $$$L$$$ to a random element of $$$G$$$.
For every suffix $$$i \in G$$$, find $$$\text{dis[i]}$$$, the longest common prefix of the suffixes $$$i$$$ and $$$L$$$. This can be done in $$$O(\vert G \vert \cdot \log{n})$$$ using binary search + hashing. We store $$$\text{dis}$$$ in a sorted manner.
Let $$$m$$$ be the minimum value in $$$\text{dis[]}$$$. It's easy to see that the internal node created from splitting $$$G$$$ will occur at depth $$$\text{dep} + m$$$. We create $$$s_G$$$, and add an edge corresponding to the substring $$$S[L + dep + 1, L + \text{dep} + m]$$$ from $$$s_{G_p}$$$ to $$$s_G$$$.
Now, we delete all suffixes $$$i \in G : \text{dis[i]} = m$$$, from $$$G$$$ (and their corresponding elements from $$$\text{dis}$$$), and group them into disjoint subsets based on the character $$$S_{i + \text{dep} + m + 1}$$$ for suffix $$$i$$$ (basically the next character after the internal node).
We call $$$f(G', \text{null}, \text{dep} + m, \text{null})$$$ for every newly created subset $$$G'$$$, and also call $$$f(G, L, \text{dep + m}, \text{dis})$$$ for the modified subset $$$G$$$.

Note: There might be some off-by-one errors.

Complexity Analysis

Consider the following problem:

We have $$$n$$$ singleton sets, and are given some set merge operations. When merging sets $$$A$$$ and $$$B$$$, we merge $$$B$$$ to $$$A$$$ with probability $$$\frac{\vert A \vert}{\vert A \vert + \vert B \vert}$$$ and $$$A$$$ to $$$B$$$ with the probability $$$\frac{\vert B \vert}{\vert A \vert + \vert B \vert}$$$.

The above problem is computationally equivalent to Randomized Quicksort, which has an expected time complexity of $$$O(n \log{n})$$$.

It's not difficult to see that our split operations are simply the operations that will occur in the above problem in a reversed manner (Formally, we can define a bijective relationship between the two sets of operations, such that related sets of operations will occur with the same probability) . Therefore, the time taken by all the split operations is $$$O(n \log{n})$$$.

However, every time we perform a split operation (merge in reverse), we also compute $$$\text{dis}$$$ for the child set $$$C$$$ (which gets merged into the parent set), and that takes $$$O(\vert C \vert \log{n})$$$ time. Thus, our entire algorithm has an expected time complexity of $$$O(n \log^2{n})$$$.

Code

My implementation can be found here.

Some thoughts

This trick seems to have some "online capability", as we can efficiently split a group of nodes into multiple groups (given that the information for query for a group can be processed mostly through a randomly chosen leader element). For example, consider the following problem:

Problem 1

You are given a tree on $$$n$$$ nodes. You also have a set containing all nodes, $$${1, 2, \dots , n}$$$. You have to process the following queries online:

"$$$1\; m\; x\; v_1\; v_2\; \dots \; v_x$$$" : Remove the nodes $$$v_1, v_2 \dots, v_x$$$ from the set $$$S$$$ whose maximum element is $$$m$$$, and create a new set with these elements (it is guaranteed that there exists some set with maximum element $$$m$$$ and $$$v_i \in S \; \forall \; i$$$).

"$$$2 \; m$$$" : Let the set whose maximum element is $$$m$$$ be $$$S$$$. Find some node $$$x \in S \mid \max_{y \in S}{\text{dis}(x, y)} = \max_{u, v \in S}{\text{dis}(u,v)} $$$.

Solution