This is an extension of my article proposing the "ultimate" NTT. Here I will present a simplified and optimized variant of the Barrett reduction algorithm on that page. As it relaxes certain bounds of the standard parameters of the algorithm, I will also present a proof of the correctness of the variant, at least for this specific case.
First, the algorithm:
function mulMod(a: Long, b: Long): Long {
xh = multiplyHighUnsigned(a, b) // high word of product
xl = a * b // low word of product
g = multiplyHighUnsigned( ((xh << 2) | (xl >>> 62)) , BARR_R)
t = xl - g * MOD - MOD
t += (t >> 63) & MOD
return t
}
Notice that the middle routine has been greatly simplified; instead of computing the high and middle words of $$$xr$$$, we simply take the top 64 bits of $$$x$$$, in effect computing $$$\displaystyle \left\lfloor \left\lfloor \frac{x}{2^{62}} \right\rfloor \frac{r}{2^{64}} \right\rfloor$$$ in place of $$$\displaystyle \left\lfloor \frac {xr}{4^{63}} \right\rfloor$$$. Experimentally, this seems to work (and indeed, this variant is how Barrett reduction is normally implemented for multi-word cases), however a naive adjustment of the bounds of the Barrett reduction algorithm would relax the pre-normalization bound of $$$t$$$ to $$$[-m, 2m)$$$. This is bad news, as there is a possibility of overflowing into the sign bit, thus forcing the use of a modulus one bit shorter (e.g. $$$4611686018326724609$$$, $$$g = 3$$$) so that we could easily correct the result with two normalization rounds.
So the following is a proof of correctness for this relaxed algorithm, at least for the very specific use case in the described NTT:
On the Project Nayuki article on Barrett reduction (note that I will use $$$n$$$ instead of $$$m$$$ for the modulus in accordance to the notation on that page), one important lemma is the inequality
$$$\displaystyle\frac{x}{n} - 1 < \frac{xr}{4^k} ≤ \frac{x}{n} \implies \frac{x}{n} - 2 < \left\lfloor \frac{x}{n} - 1 \right\rfloor ≤ \left\lfloor \frac{xr}{4^k} \right\rfloor ≤ \frac{x}{n}$$$
We shall show that this inequality holds even if $$$\dfrac{xr}{4^k}$$$ is replaced by $$$\dfrac{(x-\delta) r}{4^k}$$$, with $$$0 ≤ \delta < 2^{62}$$$, the bits that have been omitted.
First we need to obtain a tighter bound on $$$r$$$. Given that $$$r$$$, $$$k$$$, and $$$n$$$ are fixed in our algorithm, we can verify that $$$\dfrac {4^k} n - r < 2^{-9}$$$. Let this value be $$$\varepsilon$$$. We thus obtain $$$\displaystyle \frac{4^k}{n} - \varepsilon < r < \frac{4^k}{n}$$$
Multiply by $$$x-\delta \geq 0$$$: $$$\displaystyle (x-\delta)\left(\frac{4^k}{n} - \varepsilon\right) ≤ (x-\delta)r ≤ (x-\delta)\frac{4^k}{n}$$$
Given that $$$\delta$$$ is nonnegative we can relax the right bound to $$$x\dfrac{4^k}{n}$$$.
Divide by $$$4^k$$$: $$$\displaystyle (x-\delta)\left(\frac{1}{n} - \frac \varepsilon {4^k}\right) ≤ \frac {(x-\delta)r}{4^k} ≤ \frac{x}{n}$$$
Recompose the leftmost expression: $$$\displaystyle \frac x n - \left(\frac {\varepsilon x} {4^k} + \frac \delta n\right) + \frac{\delta\varepsilon} {4^k} ≤ \frac {(x-\delta)r}{4^k} ≤ \frac{x}{n}$$$
$$$\displaystyle\frac{\delta\varepsilon} {4^k} \geq 0$$$, so relax the bound: $$$\displaystyle \frac x n - \left(\frac {\varepsilon x} {4^k} + \frac \delta n\right) ≤ \frac {(x-\delta)r}{4^k} ≤ \frac{x}{n}$$$
$$$x < n^2 < 4^k \implies \dfrac{x}{4^k} < 1$$$, so further relax the bound: $$$\displaystyle \frac x n - \left({\varepsilon} + \frac \delta n\right) < \frac {(x-\delta)r}{4^k} ≤ \frac{x}{n}$$$
$$$\delta < 2^{62}$$$ while $$$n$$$ is very slightly below $$$2^{63}$$$, so it can be verified that $$$\dfrac \delta n$$$ must be $$$< \dfrac34$$$.
$$$\dfrac34 + \varepsilon < 1$$$, therefore $$$\displaystyle \frac x n - 1 < \frac {(x-\delta)r}{4^k} ≤ \frac{x}{n}$$$
We can thus follow the rest of the proof in the Nayuki article to prove that our $$$t$$$ is in $$$[-n, n)$$$.
Samples (Kotlin, so YMMV on benchmarks for other languages)
74853038 for 993E - Никита и порядковая статистика, 328 ms faster
74852214 for 986D - Идеальное кодирование, 452 ms faster
Addendum
There are other applications of this proof besides NTT. For example, I found a safe prime $$$9223372036854771239$$$, which is a good property for a rolling-hash modulus to have, as any number in $$$[2, m-2]$$$ will have multiplicative order at least $$$\dfrac{m-1}{2}$$$. It seems that $$$\varepsilon$$$ and $$$\dfrac\delta n - \dfrac 1 2$$$ will both be rather small for moduli so close below a power of 2.