Given $$$k,m$$$ and $$$n$$$ numbers $$$a_i$$$, compute each $$$a_i\times k\bmod m$$$
I came up with a algorithm that is almost twice as fast as the compiler implementation (when $$$m$$$ is a constant), which can effectively speed up the NTT and some DPs.
First of all
where $$$\{x\}=x-\lfloor x\rfloor$$$ is the fractional part function. The principle of the above equation is that $$$a_i\times k\bmod m$$$ is only related to the fractional part of $$$\frac{a_i\times k}m$$$.
By analogy with barrett reduction, let $$$\frac km\approx\frac p{2^{64}}$$$, then
Here are two multiple choice questions.
- $$$\frac p{2^{64}}\le\frac km$$$ or $$$\frac p{2^{64}}\ge\frac km$$$.
- The last formula does not necessarily work out as an integer, to be rounded down or up.
We choose $$$\frac p{2^{64}}\ge\frac km$$$, with the final formula rounded down. The former will make the result biased large, the latter will make the result biased small, a perceptual understanding so that the result will be more accurate.
Theorem: Let $$$p=\lceil\frac km\times2^{64}\rceil$$$, when $$$a_i\le \frac{2^{64}}m$$$, the calculation result of the lower rounding is exact.
Proof: written outside
const int P = 998244353;
void calc(int n, int k, int a[]) {
unsigned long long p = ((unsigned __int128)k << 64) + P - 1) / P;
for(int i = 0; i < n; i++)
a[i] = (unsigned)a[i] * p * (unsigned __int128)P >> 64;
}
A few notes.
- The code uses
unsigned __int128
so it can only be used in a 64-bit architecture. (unsigned)a[i] * p
will automatically be modulo 2^64.* (unsigned __int128)P >> 64
is 64-bit multiplication retaining only the high 64 bits (in the rdx register), the same speed as multiplying twounsigned long long
s together.- Converting
a[i]
tounsigned
is becauseint
tounsigned
and then tounsigned long long
is faster than going directly tounsigned long long
, which requires sign extension.
Speed test.
code, contains two parts.
- The first part is the reciprocal throughput, the time taken by the CPU to be highly parallel (modern CPUs can be parallelized at instruction level on a single core), containing a total of $$$50000\times50000$$$ modulo multiplications.
- The second part is the Latency, which is the time taken for each modulo multiplication to be performed sequentially without parallelism, containing a total of $$$50000\times25000$$$ modulo multiplications.
Possible output:
Throughput test(50000 * 50000):
Compiler's signed modulo: 1954.83 ms
Compiler's unsigned modulo: 1746.73 ms
My modulo: 1160.47 ms
Latency test(50000 * 25000):
Compiler's signed modulo: 4329.33 ms
Compiler's unsigned modulo: 3945.29 ms
My modulo: 2397.97 ms
By the way, a few general facts.
int
tounsigned
then tolong long
is faster thanlong long
, but negative numbers will be wrong.unsigned long long
modulo constants is faster thanlong long
.- Constant modulo multiplication is almost 4 times faster in parallel than serial (as is modulo multiplication of my invention).
Extensions
It is also possible to compute $$$(a_1b_1+a_2b_2+\cdots+a_nb_n)\bmod m$$$, but $$$\sum a_i$$$ cannot exceed $$$\frac{2^{64}}m$$$.
Let $$$p_i=\lceil\frac{b_i}m\times2^{64}\rceil$$$.