This blog focuses on the GCC compiler.
The most basic way to find a prime factorization for an integer $$$n$$$ is to test every possible divisor up to $$$\sqrt{n}$$$. This can be optimized by precomputing primes and only testing those. What else can we do? Note that the modulo operation is rather expensive. To optimize it, we can use something like the Montgomery multiplication, but it's too complicated for an "easy way". But also note that when the divisor is a compile-time constant, the compiler optimizes the modulo operation (or division, for that matter) to cheap multiplication and bit-shift operations. Let's take advantage of that:
#include <bits/stdc++.h>
using namespace std;
typedef uint32_t u32;
// The cold attribute tells the compiler that this function is unlikely to be
// called. Without it, compilation will take much more time because the
// compiler tries to optimize the function calls. Attributes are a GCC
// extension.
__attribute__((cold))
void factor_helper(vector<u32> &vec, u32 &x, u32 y)
{
do {
vec.push_back(y);
x /= y;
} while (x % y == 0);
}
vector<u32> factor(u32 x) {
vector<u32> vec;
#define F(y) if (x%y == 0) { factor_helper(vec, x, y); }
F(2)F(3)F(5)F(7)F(11)F(13)... /* all primes up to sqrt(1e9) */
#undef F
if (x > 1)
vec.push_back(x);
return vec;
}
As you can see, it's pretty simple and the long line containing all primes can be generated with a few lines of code. The only problem that remains is its long compile time. The above code is tweaked in a way so that the compilation can be done within a few seconds with GCC (which has been achieved through trial and error, like how the parameters are passed, the cold attribute, etc.), so that's no longer an issue either. Also unsigned integers are used because they are faster in this case.
But how fast actually is this? With my tests it can factorize $$$10^6$$$ integers in less than 2 seconds! Feel free to test it for yourself (Be sure to use the latest version of g++ available on Codeforces for testing or actually using this).
That sounds really crazy...
Including all required primes. I will Ctrl-C + Ctrl-V going forward.
Same here.
Here you go.
You can do without pasting enormous sequence of ints...
Runs in 2199s in custom run... Had to split in chunks of 800, because CF has limit of 900 on recursive templates.
Wow that's nice! Guess I should learn more about constexpr and templates. Although the original version runs in 1341ms...
I think the difference in execution time might be due to different bounds. I tried out the suggestion by ToxicPie9 below and adjusted the bounds:
Runs 1419ms now.
You can replace
fact_p<L+1, R>
withfact_p<L, (L+R)/2>
andfact_p<(L+R)/2, R>
to get $$$\log(P)$$$ depth so it fits into the limit of 900.constexpr + template are all you need
Here is my final version using C++ templates and constexpr (based on adamant's code here), so you don't have to paste 4000 prime numbers:
With standard optimizations (
-O2
), its performance matches the code in the blog. It uses templates and thealways_inline
attribute to generate code like a macro.It took a lot of modifications to get a working one, and I needed some time to analyze why some versions are a lot slower than other ones. Below are some interesting technical facts, reader discretion is advised.
The key here is the
noinline
attribute (cold
might also work), which preventsfactor_helper
from being inlined. Without it, not only would compiling time increase a lot, the runtime also increases by about 6 times!When
factor_helper
is not inlined, The compiled code has a structure like this:Here only
x % p == 0
have optimized operations, but that's ok because divisions infactor_helper
only happen $$$O(\log(x))$$$ times.On the other hand, when
factor_helper
is inlined, The compiled code has a structure like this:When
push_back
is also inlined, this results in a enormous (up to 470 kB)factor
function, which is so large that cache misses start to have a significant effect. In adamant's case where he manually inlinedfactor_helper
, cache and branch misses caused the same amount of instructions to take 5.5x time to run. Took us a while to figure out what was going on :|#pragma GCC unroll
is faster to compile than C++ templates.Wow I didn't know unroll can be used like this, thank you for the info. It also makes the code a lot shorter and cleaner.
Have you checked it in CF custom test?
I get compilation time out on G++20 and G++17 (64 bit), and on G++17 the runtime is 7878ms...
I didn't get a compile timeout in the custom test, even though it took a long time to compile.
Ok, I found the problem.
__attribute__((noinline))
must be nearfactor_helper
for this to compile:This runs in 1357ms.
Actually, you can do this even faster, b/c you don't need to check primes $$$p*p>x$$$. This is ~3-5 times faster than your solution. (using recursive templates)