Given $$$k,m$$$ and $$$n$$$ numbers $$$a_i$$$, compute each $$$a_i\times k\bmod m$$$
I came up with a method that is almost twice as fast as the compiler implementation (when $$$m$$$ is a constant), which can effectively speed up the NTT and some DPs.
First of all
where $$$\{x\}=x-\lfloor x\rfloor$$$ is the fractional part function. The principle of equation $$$(1)$$$ is that $$$a_i\times k\bmod m$$$ is only related to the fractional part of $$$\frac{a_i\times k}m$$$.
Let $$$\frac km\approx\frac p{2^{64}}$$$, $$$p$$$ is either equal to $$$\lfloor\frac km\times2^{64}\rfloor$$$ or $$$\lceil\frac km\times2^{64}\rceil$$$, which one to choose, we will decide later, then
There are two place we need to choose whether round up or down:
- $$$p=\lfloor\frac km\times2^{64}\rfloor$$$ or $$$p=\lceil\frac km\times2^{64}\rceil$$$.
- The final formula in (2) isn't always an integer, so we need to consider whether we round it up or down.
We choose $$$p=\lceil\frac km\times2^{64}\rceil$$$, which will slightly enlarge the answer, while the final formula in (2) rounded down, which will slightly lessen the answer. We'll prove that this set of choice can give us correct answer just if $$$a_i\le\frac{2^{64}}m$$$.
const int P = 998244353;
void calc(int n, int k, int a[]) {
unsigned long long p = ((unsigned __int128)k << 64) + P - 1) / P;
for(int i = 0; i < n; i++)
a[i] = (unsigned)a[i] * p * (unsigned __int128)P >> 64;
}
A few notes.
- The code uses
unsigned __int128
so it can only be used in a 64-bit architecture. (unsigned)a[i] * p
will automatically be modulo $$$2^{64}$$$.* (unsigned __int128)P >> 64
is 64-bit multiplication retaining only the high 64 bits (in the rdx register), the same speed as multiplying twounsigned long long
s together.- Converting
a[i]
tounsigned
is becauseint
tounsigned
and then tounsigned long long
is faster than going directly tounsigned long long
, which requires sign extension.
Speed test.
It contains two parts.
- The first part is the reciprocal throughput, the time taken by the CPU to be highly parallel (modern CPUs can be parallelized at instruction level on a single core), containing a total of $$$50000\times50000$$$ modulo multiplications.
- The second part is the Latency, which is the time taken for each modulo multiplication to be performed sequentially without parallelism, containing a total of $$$50000\times25000$$$ modulo multiplications.
Possible output:
Throughput test(50000 * 50000):
Compiler's signed modulo: 1954.83 ms
Compiler's unsigned modulo: 1746.73 ms
My modulo: 1160.47 ms
Latency test(50000 * 25000):
Compiler's signed modulo: 4329.33 ms
Compiler's unsigned modulo: 3945.29 ms
My modulo: 2397.97 ms
By the way, a few general facts.
- Constant modulo multiplication is almost 4 times faster in parallel than serial (as is modulo multiplication of my invention).
int
tounsigned
then tolong long
is faster thanlong long
, but negative numbers will be wrong.unsigned long long
modulo constants is faster thanlong long
.
Comparison with other methods
Comparison with Barrett reduction and Montgomery multiplication:
The purpose of my method is to compute $$$a\times b\bmod m$$$ for fixed $$$m$$$ and $$$b$$$, while Barrett reduction and Montgomery multiplication compute $$$a\times b\bmod m$$$ for fixed m. But my method is faster than the other two methods.
The derivation of my method is similar to Barrett reduction. So They both work when $$$m < 2^{32}$$$, while Montgomery multiplication works when $$$m < 2^{64}$$$ and $$$m$$$ is an odd number.
Extensions
It is also possible to compute $$$(a_1b_1+a_2b_2+\cdots+a_nb_n)\bmod m$$$, but $$$\sum a_i$$$ cannot exceed $$$\frac{2^{64}}m$$$.
Let $$$p_i=\lceil\frac{b_i}m\times2^{64}\rceil$$$.