Modular inverse, Montgomery multiplication, and integer factorization

→ Pay attention

Before contest
Codeforces Round 1006 (Div. 3)
3 days
Register now »

→ Streams

The 2025 Universal Cup Finals

By tourist

Stream is running

Greedy Algorithms — Topic Stream

By Shayan

Before stream 07:19:42

View all →

→ Top rated

#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

Countries | Cities | Organizations

View all →

→ Top contributors

#	User	Contrib.
1	cry	167
2	-is-this-fft-	162
3	Dominater069	160
4	Um_nik	158
5	atcoder_official	156
6	Qingyu	155
7	djm03178	152
7	adamant	152
9	luogu_official	150
10	awoo	147

View all →

→ Find user

→ Recent actions

Detailed →

sslotin's blog

Modular inverse, Montgomery multiplication, and integer factorization

By sslotin, 3 years ago, In English

I've been working on computational number theory for my book and recently wrote a few new articles on:

calculating modular inverse using binary exponentiation and extended Euclidean algorithm (spoiler: Euclid wins on average);
Montgomery multiplication and fast modulo / division / divisibility checks for when the divisor is constant (which can often be used to speed up modular arithmetic by 5-15x);
how to use them in various factorization algorithms (upd: fixed link) (writing a 3x-faster-than-state-of-the-art implementation of Pollard's algorithm in the process).

This is largely inspired by Jakube's tutorials and Nyaan' library implementations. Thank you!

If someone is interested, I may or may not try to optimize the sieve of Eratosthenes next: I believe it is possible to find all primes among the first $$$10^9$$$ numbers in well under a second.

sslotin
3 years ago
8

Comments (8)

Write comment?

Golovanov399

3 years ago, # |

+11

Thank you!

First of all, your factorizations link leads to an incorrect page. Second, you write there that for integers up to $$$2^{64}$$$ it is better to use Pollard's rho, but you are aware of some advanced factorization algorithms, as you mention some of them for larger $$$n$$$. Does it mean that your implementation is faster than, for example, some implementations of SQUFOF? (here in C++ or here in C)

→ Reply

sslotin

3 years ago, # ^ |

I filtered out SQUFOF early on because it has the same asymptotic complexity but looks much more complex and arithmetic-intensive.

I added dacin21's implementation to the benchmark, and it measures 425 factorizations per second, which is 7 times slower than Pollard-Brent with Montgomery multiplication. It can probably be optimized, but not by an order of magnitude.

→ Reply

dacin21

3 years ago, # ^ |

The main advantage of SQUFOF was that it mostly uses 32 bit integers. Back in the days, codeforces was still using 32 bit machines, so this was great. On modern 64-bit machines however, SQUFOF greatly suffers from division operations by changing values, which rule out the typical Barrett / Montgomery tricks, and having to compute square roots.

In all honesty, I'm kinda glad I get to retire my SQUFOF code, as "fast" was just about the only positive thing I could say about it.

PS: how fast is your code compared to tinyecm.c?

→ Reply

oversolver

3 years ago, # |

Finally we can remove this blog from catalog!

→ Reply

TLE

3 years ago, # |

+24

If you're interested, https://loj.ac/p/6466 is a testbed for integer factorization ($$$n\le 10^{30}$$$). There are public implementations of quadratic sieve, elliptic-curve factorization and some pollard rho's :)

→ Reply

clyring

3 years ago, # |

← Rev. 4 →

I'm a bit surprised that you found a division-based implementation of the extended Euclid algorithm typically faster than exponentiation-by-squaring for finding an inverse mod $$$10^9 + 7$$$ on your setup, even knowing that general 32-bit divmod isn't that slow on most hardware. But it's close and I expect with more optimizations, exponentiation-by-squaring can retake the lead. The modmuls themselves can be made perhaps 10-20% faster using relaxed Barrett reduction or the variant Montgomery reduction I discuss below, and (if you want) there are several ways to reach $$$10^9 + 5$$$ with 37 modmuls instead of 43.

long Barrett reduction comments

There's also a slightly simpler and faster relative of Montgomery reduction that uses wide multiplications. (So it sacrifices the traditional main advantage of Montgomery reduction, in exchange for being as simple and fast as Lemire reduction.) The idea is simple: wideMul(x * pinv, p) is clearly a multiple of p and its lower word is equal to x. So, its upper word must be $$$-2^{-64} \cdot x$$$.

It's my understanding that the highest-precision multiplication operands typically included in SIMD instruction sets are doubles, providing only 52+1 bits of precision. So I'm curious about your plans for using SIMD to further speed up factorization of 60-bit integers.

I tried my own hand at a somewhat optimized sieve of Eratosthenes in January. The basic optimizations of segmenting the sieve into blocks small enough to fit into the L2 cache, and skipping over all even numbers plus a dumb Haskell-specific workaround/optimization or two were enough to be able to generate and traverse the list of all primes up to a billion in about 1.7 seconds on the Codeforces servers. I expect just applying the wheel-sieve idea with primes 3 and 5 as well, and using a language where the expected output isn't 2GB-large lazy linked list would get under a second with a little room to spare.

→ Reply

lmn207

2 years ago, # |

nice rick roll =))

→ Reply

SPyofgame

2 years ago, # |

I tried to dig some research into papers and github projects before, and it seems to be somewhat possible. You can use look-up table on a Bitwise Range Segment Erastothenes Sieve with several modifications and optimizations using bit manipulation tricks. AFAIK there were some codes that can actually count prime under $$$10^9$$$ in one second using this way (but can not store them that fast, and not worth it since there are already better algorithms for counting).

→ Reply