TL;DR: the popular empirical bounds for FFT stability overestimate the precision by a few bits -- multiplication might thus produce wrong answer even after thorough testing on random inputs.
Introduction
So here's how people typically advise you to use FFT after proving various theorems and providing a $$$\mathcal{O}(n \log n)$$$ algorithm.
Suppose you want to multiply two big integers, stored in binary. Split them into groups of $$$B$$$ bits, called $$$a_0, a_1, \dots$$$ and $$$b_0, b_1, \dots$$$ respectively, so that the integers are equal to $$$a_0 + a_1 x + \dots$$$, $$$b_0 + b_1 x + \dots$$$ for $$$x = 2^B$$$. Multiply the polynomials $$$(a_0 + a_1 x + \dots) (b_0 + b_1 x + \dots)$$$ via FFT and substitute $$$x = 2^B$$$ to obtain the $$$n$$$-bit product.
$$$B$$$ is a variable here -- the algorithm works for every $$$B$$$, but larger $$$B$$$s are faster. We're limited by precision of 'double' though, because 'double' can only precisely store integers from $$$0$$$ to $$$2^{53} \approx 9 \cdot 10^{15}$$$ (e.g. $$$2^{53}+1$$$ cannot be represented by double
). So we certainly require $$$2B + \log_2 \frac{n}{B} \le 53$$$ to be able to represent the product, but the actual limit is not $$$53$$$ but a bit less because of precision loss at intermediate stages. So there's this incentive to use $$$B$$$ as large as possible, but exactly how much we can increase it cannot be predictable because of the sheer complexity of analysis of floating-point error.
So here's what we're gonna do: let's start with some large value of $$$B$$$, run a few polynomials through this algorithm and check if the product is correct. If it isn't, decrease $$$B$$$ and repeat. If it is, call it a day and use this value of $$$B$$$ for the rest of your life. That's what I've always done. That's what I did writing my own bigint library, but I wanted to be more thorough so I stress-tested it for 20 hours in 8 threads on random input. Zero errors.
Twist
The problem is -- this is bullshit. The fact that the algorithm works for random data says nothing about how it behaves in worst case, for one simple reason. For random data, the errors accumulated during the intermediate stages are sort of independent and can be approximated as a one-dimensional random walk. The neat part about a $$$k$$$-step 1D random walk is that, on average, the distance from $$$0$$$ is around $$$\sqrt{k}$$$, as opposed to the maximal possible value of $$$k$$$. So the error is severely underestimated.
What is the worst case for FFT, then?
When the data is random, errors in FFT of the input polynomials sort of cancel each other. If, however, the FFT of the input polynomials had a few large coefficients and lots of small coefficients, the error of the output would essentially be determined by the error of the large coefficients alone.
Say, let $$$P(x) = A \sum_j e^{2 \pi ij/n} x^j = A \sum_j (e^{2 \pi i/n} x)^j$$$ for some arbitrary $$$A$$$. This polynomial is $$$0$$$ at all roots of unity except $$$e^{-2 \pi i/n}$$$, where it's $$$n A$$$. This is a good candidate, because the correctness of the algorithm is then highly dependent on $$$n$$$ being computed precisely -- which is much harder than, say, computing FFT of $$$P(x) = A \sum_j x^j$$$, because computing the former just adds a few numbers together.
We need integer coefficients, though, so we're gonna let $$$P(x) = \sum_j \lfloor A \cos(2 \pi ij/n + B) \rfloor x^j$$$ for some large integer $$$A$$$ and arbitrary $$$B$$$. This introduces a small disturbance, but that's fine if $$$A$$$ is sufficiently large. (Also, depending on the algorithm, you might not need to get rid of the imaginary components.) We also need the coefficients to be non-negative, so let $$$P(x) = \sum_j \lfloor A (\cos(2 \pi ij/n + B) + 1) \rfloor x^j$$$. This effectively only increases the FFT at one point, so now we have two large coefficients: at $$$1$$$ and at $$$e^{-2 \pi i/n}$$$. But in fact that's not just fine but great -- this might exacerbate the error due to catastrophic cancellation. There'll also be zero-padding, but that's alright too.
You know what happened when I squared this polynomial for $$$B = 0, 0.1, 0.2, \dots$$$? The assertions failed instantly, in under 5 seconds.
That's bad. Like, really bad. Everyone's been doing this, everyone's been teaching this works. I've handwaved that "twiddle factors are pseudorandom so errors are pseudorandom too so we've got good bounds" when teaching others.
Fallout
So. How do we deal with this?
Well, for one thing, it might be time to start looking through solutions using FFT more carefully when hacking and attempt to crack them this way. :-)
But seriously.
With this testcase, my code for multiplication of $$$10^6$$$-digit numbers (decimal), which worked for pretty much all inputs, suddenly could only deal with $$$6 \cdot 10^5$$$-digit numbers, which is not great. (I used $$$B = 16$$$.) Your code, utilizing polynomials with coefficients in range $$$[0; 10^4)$$$, which you thought would work for $$$7 \cdot 10^6$$$-digit numbers, might actually only be good for $$$3 \cdot 10^6$$$-digit numbers.
Perhaps you could reduce $$$B$$$. But that's its own can of worms. For instance, if you're storing data in binary, $$$B = 16$$$ is good because splitting 64-bit words into 16-bit words is trivial. $$$B = 8$$$ works too, but halves the performance, and that's bad. $$$B = 12$$$ is not that easy to work with, but 4 12-bit words are 3 16-bit words, so base conversion is at least possible to perform efficiently.
If you're working in decimal, though? FFT coefficients up to $$$10^4$$$ match stored coefficients up to $$$10^{16}$$$ well (because a single stored coefficient divides into exactly 4 FFT coefficients), FFT coefficients up to $$$10^3$$$ match stored coefficients up to $$$10^{18}$$$ well, but, again, that performance impact is non-negligible. As most competitive programmers only write FFT for power-of-two sizes, the slow-down is exacerbated.
NTT? Don't get me started on NTT. NTT is terrible. NTT is so utterly terrible that y-cruncher prefers FFT as long as data fits in memory, because FFT is memory-bound, and NTT is CPU-bound, which basically means performance of FFT depends on your memory bandwidth more than processing power, while NTT is CPU-intensive. If you know how to work with FFT, you'll have a generic bigint library faster than GMP in no time. You can fit data in both real and imaginary parts to double the performance. You can use SIMD. You can abuse the IEEE 754 format to convert input and output more efficiently. As FFT is memory-bound, you can mix recursive FFT with iterative FFT to make it even faster, basically adapting to the cache topology in runtime. No way you're getting that far with NTT alone.
Solution
The simplest way to avoid the pitfall seems to be this.
After multiplying $$$P(x) \cdot Q(x) = R(x)$$$ with FFT, choose a random value $$$x_0 \in \mathbb{Z}_M$$$ and check if $$$P(x_0) \cdot Q(x_0)$$$ equals $$$R(x_0)$$$ (modulo $$$M$$$). $$$M$$$ doesn't necessarily have to equal $$$2^B$$$ or match the modulo you're otherwise using, if any.
You can also use the same trick even if you're multiplying integers as opposed to polynomials -- just perform the check before carry propagation. This is more efficient that comparing the numbers mod $$$M$$$ for a random $$$M$$$.
Prefer $$$M = 2^{64} - 1$$$ as opposed to $$$M = 2^{64}$$$. The former is a bit less efficient (but certainly more efficient than any other non-trivial modulo), but provides better guarantees -- it catches a miscomputation constructed specially to fail this check (which I'm not even sure is possible, but better safe than sorry) with probability $$$0.96$$$. If the check fails, switch to a slower but more reliable algorithm (e.g. $$$B = 12$$$ as opposed to $$$B = 16$$$ or Toom-3 or Karatsuba). This increases the runtime by about $$$10\%$$$, so you might only want to enable the check for large $$$n$$$ -- e.g., for 1 or 2 bits below the provided counter-example, to be safe.
Outro
I'm not providing any code here -- mostly because exactly how you're going to implement the checks highly depends on the structure and the exact algorithms of your bignum arithmetic, and my snippets have grown out of proportion compared to typical competitive programming libraries (which makes them much faster, but still). I'll probably write another article on that later.
The basic counter-example was provided in
Colin Percival, Rapid multiplication modulo the sum and difference of highly composite numbers, Mathematics of Computation, Volume 72, Number 241, Pages 387-395, 2002.
I adapted it to the use case of integer multiplication via FFT and provided the analysis and the workarounds.
This raises another question, actually. There's lots of knowledge out there, already present in an easily consumable way, and we... sort of ignore it? Might be something to ponder about.
Kactl's implmentation of FFT has a proof that rounding is safe if $$$(\sum a_i^2 + \sum b_i^2)\log_2{N} < 9\cdot10^{14}$$$.
I don't agree with this. Not having any precision errors is wonderful. Also, NTT implemented properly with Montgomery (+ some other optimizations) makes it surpringly fast. For example see my submission on yosupo for the benchmark problem convolution large. There I read two $$$2^{24}$$$ degree polynomials, multiply them mod 998244353, and print the answer, in under 2s. This is without using simd.
If you look around on different yosupo convolution benchmark problems, you will see that NTT is generally used by the fastest solutions. For example the fastest solution for the mod $$$10^9 + 7$$$ convolution benchmark on yosupo uses NTT. It is not obvious at all which of FFT and NTT that is the fastest. Furthermore, even if it turns out that FFT is faster in some scenario, there is always the issue that it could give the wrong result for some input. NTT does not have this drawback.