Fast modular multiplication

#	User	Rating
1	jiangly	3898
2	tourist	3840
3	orzdevinwang	3706
4	ksun48	3691
5	jqdai0815	3682
6	ecnerwala	3525
7	gamegame	3477
8	Benq	3468
9	Ormlis	3381
10	maroonrk	3379

#	User	Contrib.
1	cry	168
2	-is-this-fft-	165
3	Dominater069	161
4	atcoder_official	160
5	Um_nik	159
6	djm03178	157
7	adamant	153
8	luogu_official	151
9	awoo	149
10	TheScrasse	146

I wrote an article/blog about how to do fast modular multiplication:

https://simonlindholm.github.io/files/bincoef.pdf

tl;dr:

avoid latency-bound loops
dynamic modulo is slow, constant modulo is fast
if you perform many multiplications with the same dynamic modulo, you can do what the compiler does and use Barrett reduction (involves some precomputation)
it is actually possible to beat the compiler if you accept a result in [0, 2*MOD):

uint64_t reduce(uint64_t a) {
  return a - (uint64_t)((__uint128_t(-1ULL / MOD) * a) >> 64) * MOD;
}

Same goes for addition and subtraction: if you can live with a result in [0, 2*MOD), just do a + b or a - b + MOD and skip the range correction that brings the result into [0, MOD). Delay modular reductions far as possible, ideally combining them with multiplications. While being mindful of overflows, of course.

on 32-bit, use Montgomery multiplication instead, to avoid __uint128_t
if really desperate, combine Montgomery multiplication with SIMD; this runs 3x faster than Barrett reduction when AVX2 is available
for larger numbers (e.g. multiplication of 64-bit numbers), use floating-point based methods