Say we need to solve a recurrence of the form :↵
$$ a_n = f \left (n, \displaystyle \sum_{i=0}^{n-1} a_i b_{n-i} \right ) \mod m $$↵
↵
Here, $f$ is an arbitrary function which can be computed in $O(1)$. An example of such a problem is [problem:553E]. The tutorial gives a solution in $ N \log ^ 2 (N) $. ↵
↵
I recently saw a similar problem in codechef may long [SERSUM](https://www.codechef.com/MAY18A/problems/SERSUM), where I used another technique which I had thought of when I saw the prior problem. In this post I explain this very simple and intuitive sqrt decomposition solution, running in $O(N \sqrt{N \log{N}}) $ which works in almost the same time as $N \log^2(N)$ solution because of a very low constant. Here goes:↵
↵
Let us divide into blocks of size say $K$, giving us a total of $\lceil N / K \rceil $ blocks. Block $t$ consists of $[(t-1)K, tK - 1]$↵
↵
Let $B_t = \displaystyle \sum_{i=0}^{Kt - 1} b_i x^i$↵
↵
Let $A_t = \displaystyle \sum_{i=0}^{K t -1} a_i x^i $.↵
↵
Let $C_t = A_{t} B_{t}$.↵
↵
When processing the block $t + 1$ we assume that we know $C_t$, which is computed when block $t$ has been completely processed. So, we compute $C_t$ a total of $\frac{N}{K}$ times using FFT in time $O(N ^ {2} / K \log N)$.↵
↵
Consider some $n$ between $[t K, (t + 1) K - 1]$. Then, ↵
$ c_n = \displaystyle \sum_{i=0}^{n-1} a_i b_{n-i} $↵
↵
$ = \displaystyle \sum_{i=0}^{K t - 1} a_i b_{n-i} + \displaystyle \sum_{i=Kt}^{n-1} a_i b_{n-i}$↵
↵
$ = [x^n] C_t + \displaystyle \sum_{i=Kt}^{n-1} a_i b_{n-i}$.↵
↵
Clearly, the right part can be calculated in $O(K)$, and we have an extra time of $O(N K)$ because of this part.↵
↵
Overall complexity = $O(N \log N \frac{N}{K} + N K) $, which is minimized taking $K = \sqrt{N \log N}$, and overall complexity is $O(N \sqrt{N \log N}) $.↵
↵
An optimization:↵
↵
Note that the right part can be computed with only one mod operation if we use m2 = m * m, and keep subtracting m2 when the total sum exceeds it. Eventually we take total sum modulo m. This is a very common technique, which we use in matrix multiplication as well.↵
↵
This means that we should use a block size even greater than $\sqrt{N \log N} $ for better results, because of a huge constant difference in the two parts. (Computing $C_t$ using FFT, and computing the remaining sum in $O(k)$ using brute force iteration with the optimization above.) Just as an example, for $N = 10^5$, and $m = 10^9 + 7$ (which is a bad mod value for FFT problems), the best performance comes when you take $K \approx 6 \times 10^3 $. For this input size it runs in about 2 seconds, which is pretty good.↵
$$ a_n = f \left (n, \displaystyle \sum_{i=0}^{n-1} a_i b_{n-i} \right ) \mod m $$↵
↵
Here, $f$ is an arbitrary function which can be computed in $O(1)$. An example of such a problem is [problem:553E]. The tutorial gives a solution in $ N \log ^ 2 (N) $. ↵
↵
I recently saw a similar problem in codechef may long [SERSUM](https://www.codechef.com/MAY18A/problems/SERSUM), where I used another technique which I had thought of when I saw the prior problem. In this post I explain this very simple and intuitive sqrt decomposition solution, running in $O(N \sqrt{N \log{N}}) $ which works in almost the same time as $N \log^2(N)$ solution because of a very low constant. Here goes:↵
↵
Let us divide into blocks of size say $K$, giving us a total of $\lceil N / K \rceil $ blocks. Block $t$ consists of $[(t-1)K, tK - 1]$↵
↵
↵
↵
Let $C_t = A_{t} B
↵
When processing the block $t + 1$ we assume that we know $C_t$, which is computed when block $t$ has been completely processed. So, we compute $C_t$ a total of $\frac{N}{K}$ times using FFT in time $O(N ^ {2} / K \log N)$.↵
↵
Consider some $n$ between $[t K, (t + 1) K - 1]$. Then, ↵
$ c_n = \displaystyle \sum_{i=0}^{n-1} a_i b_{n-i} $↵
↵
$ = \displaystyle \sum_{i=0}^{K t - 1} a_i b_{n-i} + \displaystyle \sum_{i=Kt}^{n-1} a_i b_{n-i}$↵
↵
$ = [x^n] C_t + \displaystyle \sum_{i=Kt}^{n-1} a_i b_{n-i}$.↵
↵
Clearly, the right part can be calculated in $O(K)$, and we have an extra time of $O(N K)$ because of this part.↵
↵
Overall complexity = $O(N \log N \frac{N}{K} + N K) $, which is minimized taking $K = \sqrt{N \log N}$, and overall complexity is $O(N \sqrt{N \log N}) $.↵
↵
An optimization:↵
↵
Note that the right part can be computed with only one mod operation if we use m2 = m * m, and keep subtracting m2 when the total sum exceeds it. Eventually we take total sum modulo m. This is a very common technique, which we use in matrix multiplication as well.↵
↵
This means that we should use a block size even greater than $\sqrt{N \log N} $ for better results, because of a huge constant difference in the two parts. (Computing $C_t$ using FFT, and computing the remaining sum in $O(k)$ using brute force iteration with the optimization above.) Just as an example, for $N = 10^5$, and $m = 10^9 + 7$ (which is a bad mod value for FFT problems), the best performance comes when you take $K \approx 6 \times 10^3 $. For this input size it runs in about 2 seconds, which is pretty good.↵