A modulo multiplication algorithm that is 2x faster than compiler implementation

#	User	Rating
1	tourist	4009
2	jiangly	3823
3	Benq	3738
4	Radewoosh	3633
5	jqdai0815	3620
6	orzdevinwang	3529
7	ecnerwala	3446
8	Um_nik	3396
9	ksun48	3390
10	gamegame	3386

#	User	Contrib.
1	cry	167
2	Um_nik	163
3	maomao90	162
3	atcoder_official	162
5	adamant	159
6	-is-this-fft-	158
7	awoo	157
8	TheScrasse	154
9	Dominater069	153
9	nor	153

Given $$$k,m$$$ and $$$n$$$ numbers $$$a_i$$$, compute each $$$a_i\times k\bmod m$$$

I came up with a algorithm that is almost twice as fast as the compiler implementation (when $$$m$$$ is a constant), which can effectively speed up the NTT and some DPs.

First of all

$$$ a_i\times k\bmod m=\{a_i\times \frac km\}\times m $$$

where $$$\{x\}=x-\lfloor x\rfloor$$$ is the fractional part function. The principle of the above equation is that $$$a_i\times k\bmod m$$$ is only related to the fractional part of $$$\frac{a_i\times k}m$$$.

By analogy with barrett reduction, let $$$\frac km\approx\frac p{2^{64}}$$$, then

$$$ \begin{aligned} a_i\times k\bmod m&\approx\{a_i\times \frac p{2^{64}}\}\times m\\ &=\frac{a_ip\bmod 2^{64}}{2^{64}}\times m\\ &=\frac{a_ip\bmod 2^{64}\times m}{2^{64}} \end{aligned} $$$

Here are two multiple choice questions.

$$$\frac p{2^{64}}\le\frac km$$$ or $$$\frac p{2^{64}}\ge\frac km$$$.
The last formula does not necessarily work out as an integer, to be rounded down or up.

We choose $$$\frac p{2^{64}}\ge\frac km$$$, with the final formula rounded down. The former will make the result biased large, the latter will make the result biased small, a perceptual understanding so that the result will be more accurate.

Theorem: Let $$$p=\lceil\frac km\times2^{64}\rceil$$$, when $$$a_i\le \frac{2^{64}}m$$$, the calculation result of the lower rounding is exact.
Proof: written outside

const int P = 998244353;

void calc(int n, int k, int a[]) {
    unsigned long long p = ((unsigned __int128)k << 64) + P - 1) / P;
    for(int i = 0; i < n; i++)
    	a[i] = (unsigned)a[i] * p * (unsigned __int128)P >> 64;
}

A few notes.

The code uses unsigned __int128 so it can only be used in a 64-bit architecture.
(unsigned)a[i] * p will automatically be modulo 2^64.
* (unsigned __int128)P >> 64 is 64-bit multiplication retaining only the high 64 bits (in the rdx register), the same speed as multiplying two unsigned long longs together.
Converting a[i] to unsigned is because int to unsigned and then to unsigned long long is faster than going directly to unsigned long long, which requires sign extension.

Speed test.

code, contains two parts.

The first part is the reciprocal throughput, the time taken by the CPU to be highly parallel (modern CPUs can be parallelized at instruction level on a single core), containing a total of $$$50000\times50000$$$ modulo multiplications.
The second part is the Latency, which is the time taken for each modulo multiplication to be performed sequentially without parallelism, containing a total of $$$50000\times25000$$$ modulo multiplications.

Possible output:

Throughput test(50000 * 50000):
Compiler's signed modulo:   1954.83 ms
Compiler's unsigned modulo: 1746.73 ms
My modulo:                  1160.47 ms

Latency test(50000 * 25000):
Compiler's signed modulo:   4329.33 ms
Compiler's unsigned modulo: 3945.29 ms
My modulo:                  2397.97 ms

By the way, a few general facts.

int to unsigned then to long long is faster than long long, but negative numbers will be wrong.
unsigned long long modulo constants is faster than long long.
Constant modulo multiplication is almost 4 times faster in parallel than serial (as is modulo multiplication of my invention).

Extensions

It is also possible to compute $$$(a_1b_1+a_2b_2+\cdots+a_nb_n)\bmod m$$$, but $$$\sum a_i$$$ cannot exceed $$$\frac{2^{64}}m$$$.

Let $$$p_i=\lceil\frac{b_i}m\times2^{64}\rceil$$$.

$$$ (\sum a_ib_i)\bmod m=\lfloor\frac{(\sum a_ip_i)\bmod 2^{64}\times m}{2^{64}}\rfloor $$$

Rev.	By	When	Δ	Comment
en4	platelet	2023-01-18 16:20:20	158
en3	platelet	2023-01-18 04:55:17	9829
en2	platelet	2023-01-17 18:21:08	65
en1	platelet	2023-01-17 18:14:25	3760	Initial revision (published)

Speed test.

Extensions

History