Hi everyone!
Today I want to describe an efficient solution of the following problem:
Composition of Formal Power Series. Given $$$A(x)$$$ and $$$B(x)$$$ of degree $$$n-1$$$ such that $$$B(0) = 0$$$, compute $$$A(B(x)) \pmod{x^n}$$$.
The condition $$$B(0)=0$$$ doesn't decrease the generality of the problem, as $$$A(B(x)) = P(Q(x))$$$, where $$$P(x) = A(x+B(0))$$$ and $$$Q(x) = B(x) - B(0)$$$. Hence you could replace $$$A(x)$$$ and $$$B(x)$$$ with $$$P(x)$$$ and $$$Q(x)$$$ when the condition is not satisfied.
Solutions that I'm going to describe were published in Joris van der Hoeven's article about operations on formal power series. The article also describes a lot of other common algorithms on polynomials. It is worth noting that Joris van der Hoeven and David Harvey are the inventors of breakthrough $$$O(n \log n)$$$ integer multiplication algorithm in the multitape Turing machine model.
Naive methods
An obvious solution would be to evaluate $$$B(x)$$$ in $$$n^2$$$ points, then evaluate $$$A(B(x_i))$$$ in these points and interpolate $$$(A \circ B)(x)$$$. This solution would require $$$O(n^2 \log^2 n)$$$ operations with an enormous constant. Not the best idea for $$$n \leq 8000$$$.
Another possible approach is to go from the definition of the polynomial composition:
You could compute $$$B^2(x), B^3(x), \dots, B^{n-1}(x)$$$ modulo $$$x^n$$$ one by one and sum them up, which would take $$$O(n^2 \log n)$$$ operations.
Divide and conquer
If we write $$$A(x) = A_0(x) + x^t A_1(x)$$$, then
With this decomposition, you can reduce the computation of $$$A(B(x))$$$ to $$$2$$$ recursive calls to $$$A_0(B(x))$$$ and $$$A_1(B(x))$$$ with $$$O(n \log n)$$$ overhead for the multiplication. If you use $$$t = \lfloor \frac{n}{2} \rfloor$$$, the degree of $$$A_0$$$ will be at most $$$\lfloor \frac{n}{2} \rfloor$$$ and the degree of $$$A_1$$$ will be $$$\lceil \frac{n}{2} \rceil$$$.
Considering possible values of $$$t$$$ over all recursion levels they will be at most $$$O(\log n)$$$ different values, hence you can compute necessary powers of $$$B(x)$$$ in advance, which would require $$$O(n \log^2 n)$$$ cumulative time for the pre-processing.
The overall time complexity for