I've been working on computational number theory for my book and recently wrote a few new articles on:
- calculating modular inverse using binary exponentiation and extended Euclidean algorithm (spoiler: Euclid wins on average);
- Montgomery multiplication and fast modulo / division / divisibility checks for when the divisor is constant (which can often be used to speed up modular arithmetic by 5-15x);
- how to use them in various factorization algorithms (upd: fixed link) (writing a 3x-faster-than-state-of-the-art implementation of Pollard's algorithm in the process).
This is largely inspired by Jakube's tutorials and Nyaan' library implementations. Thank you!
If someone is interested, I may or may not try to optimize the sieve of Eratosthenes next: I believe it is possible to find all primes among the first $$$10^9$$$ numbers in well under a second.
Thank you!
First of all, your factorizations link leads to an incorrect page. Second, you write there that for integers up to $$$2^{64}$$$ it is better to use Pollard's rho, but you are aware of some advanced factorization algorithms, as you mention some of them for larger $$$n$$$. Does it mean that your implementation is faster than, for example, some implementations of SQUFOF? (here in C++ or here in C)
I filtered out SQUFOF early on because it has the same asymptotic complexity but looks much more complex and arithmetic-intensive.
I added dacin21's implementation to the benchmark, and it measures 425 factorizations per second, which is 7 times slower than Pollard-Brent with Montgomery multiplication. It can probably be optimized, but not by an order of magnitude.
The main advantage of SQUFOF was that it mostly uses 32 bit integers. Back in the days, codeforces was still using 32 bit machines, so this was great. On modern 64-bit machines however, SQUFOF greatly suffers from division operations by changing values, which rule out the typical Barrett / Montgomery tricks, and having to compute square roots.
In all honesty, I'm kinda glad I get to retire my SQUFOF code, as "fast" was just about the only positive thing I could say about it.
PS: how fast is your code compared to
tinyecm.c
?Finally we can remove this blog from catalog!
If you're interested, https://loj.ac/p/6466 is a testbed for integer factorization ($$$n\le 10^{30}$$$). There are public implementations of quadratic sieve, elliptic-curve factorization and some pollard rho's :)
I'm a bit surprised that you found a division-based implementation of the extended Euclid algorithm typically faster than exponentiation-by-squaring for finding an inverse mod $$$10^9 + 7$$$ on your setup, even knowing that general 32-bit divmod isn't that slow on most hardware. But it's close and I expect with more optimizations, exponentiation-by-squaring can retake the lead. The modmuls themselves can be made perhaps 10-20% faster using relaxed Barrett reduction or the variant Montgomery reduction I discuss below, and (if you want) there are several ways to reach $$$10^9 + 5$$$ with 37 modmuls instead of 43.
Your estimate for $$$s$$$ with Barrett reduction is wrong: The range from $$$\frac{2^n-2}{2^n-1}-1$$$ to $$$\frac{2^n-2}{2^n-1}+\frac{2^n-2}{2^s}$$$ always contains the integer $$$0$$$ corresponding to the correct result of the division, but will contain the integer $$$1$$$ as well unless $$$2^s \geq x \cdot y$$$, which is not implied by $$$s \geq n$$$.
Estimating a bit more carefully, the error in $$$m$$$ is $$$\frac{-2^s \mod y}{2^s}$$$, so that $$$2^s \geq x \cdot (-2^s \mod y)$$$ is a sufficient condition for the division to yield the exact correct result. When you type
(u64)x % 1000000007
, gcc verifies this condition for $$$s = 64+29$$$ and since $$$2^{29} < 1000000007$$$, it is able to perform the division with a multiply and a shift, and then calculates the remainder with a multiply and a subtract. (This is the most common case.) For some other divisors (most famously including 7), there is no 64-bit magic constant that is correct for every 64-bit dividend, so gcc emits a longer instruction sequence effectively performing 64x65-bit multiplication.However, the off-by-one produced by $$$s=n$$$ is often good enough for intermediate modular-arithmetic calculations. The preferred method here is to set $$$s=64$$$ and use $$$m = \lfloor \frac{2^s}{y} \rfloor$$$ so that the quotient risks being too small rather than too large, and the remainder will be in $$$[0, 2y)$$$. I've heard this variant called "relaxed Barrett reduction." For happy-path divisors it saves a shift, and for unhappy-path divisors it saves more.
It's worth noting that negative signed inputs are a real headache for most fast-divmod techniques. So even when using only the compiler's built-in strength reduction it is usually a meaningful speed and code size improvement to use unsigned types or some variant of
assume(x >= 0)
.There's also a slightly simpler and faster relative of Montgomery reduction that uses wide multiplications. (So it sacrifices the traditional main advantage of Montgomery reduction, in exchange for being as simple and fast as Lemire reduction.) The idea is simple:
wideMul(x * pinv, p)
is clearly a multiple ofp
and its lower word is equal tox
. So, its upper word must be $$$-2^{-64} \cdot x$$$.It's my understanding that the highest-precision multiplication operands typically included in SIMD instruction sets are doubles, providing only 52+1 bits of precision. So I'm curious about your plans for using SIMD to further speed up factorization of 60-bit integers.
I tried my own hand at a somewhat optimized sieve of Eratosthenes in January. The basic optimizations of segmenting the sieve into blocks small enough to fit into the L2 cache, and skipping over all even numbers plus a dumb Haskell-specific workaround/optimization or two were enough to be able to generate and traverse the list of all primes up to a billion in about 1.7 seconds on the Codeforces servers. I expect just applying the wheel-sieve idea with primes 3 and 5 as well, and using a language where the expected output isn't 2GB-large lazy linked list would get under a second with a little room to spare.
nice rick roll =))
I tried to dig some research into papers and github projects before, and it seems to be somewhat possible. You can use look-up table on a Bitwise Range Segment Erastothenes Sieve with several modifications and optimizations using bit manipulation tricks. AFAIK there were some codes that can actually count prime under $$$10^9$$$ in one second using this way (but can not store them that fast, and not worth it since there are already better algorithms for counting).