I was upsolving AtCoder Beginner Contest 195 problem F, aka "Coprime Present" where you need to count subsets (of a special set of size < 72) with pairwise coprime elements.
My solution is a bitmask dp over prime factors that were used and a prefix of the set. My first submission.
Fairly straightforward, nothing smart, $$$O(N \cdot 2^n)$$$ time and memory. However, I was disappointed by the memory usage of 600+ Mb. Therefore, I implemented a two-layer version of the same dp. Here is my second submission.
The difference is minimal, I just take layer_number & 1
, and don't forget to clear the previous layer before proceeding to the next one
Obviously, it uses way less memory, only 30 Mb. What surprised me is that the running time has also decreased, from 472 ms to just 147 ms, for an improvement by a factor of 3!
I have two hypotheses on why this happens:
two layers fit into the better cache level, facilitating memory lookups;
memory allocation itself takes some time (there must be a reason why people write custom allocators, right?).
Can someone more experienced please tell me what actually happens there?