I was upsolving AtCoder Beginner Contest 195 problem F, aka "Coprime Present" where you need to count subsets (of a special set of size < 72) with pairwise coprime elements.
My solution is a bitmask dp over prime factors that were used and a prefix of the set. My first submission.
Fairly straightforward, nothing smart, $$$O(N \cdot 2^n)$$$ time and memory. However, I was disappointed by the memory usage of 600+ Mb. Therefore, I implemented a two-layer version of the same dp. Here is my second submission.
The difference is minimal, I just take layer_number & 1
, and don't forget to clear the previous layer before proceeding to the next one
Obviously, it uses way less memory, only 30 Mb. What surprised me is that the running time has also decreased, from 472 ms to just 147 ms, for an improvement by a factor of 3!
I have two hypotheses on why this happens:
two layers fit into the better cache level, facilitating memory lookups;
memory allocation itself takes some time (there must be a reason why people write custom allocators, right?).
Can someone more experienced please tell me what actually happens there?
Memory allocation
Yeah, 30 Mb is too much to fit into cache on any level, so it's purely memory allocation related speedup.
Good one :)
BTW, L4 cache exists, but that's another story :)
It's not about cache, it's about memory allocation. It's obvious that one layer wouldn't fit into cache, so there is no difference in that regard.
UPD: I didn't understand your message at first, but it seems that you meant the same thing as me :) Hope this comment with clarify our statements.
I just edited my comment to make it more clear :)
I don't know how accurate it is, but "sudo perf record -e cache-misses solution" shows 5 times more cache-misses on first solution over the second on my machine (this counters may work different on different cpus, and obviously they may have different caches).
PS. Hm, cache misses % almost the same — 75% for both solutions. Because second solution has not only 5 times less cache misses, but also 5 times less cache references.
The problem is not in memory allocation itself, but more in high memory utilization. So it seems to me that it is actually more closer to caching issues rather than expensive allocations.
Two submission examples:
First — huge amount of memory is allocated once on each run, but when it's unused, running time is low (17 ms on the smallest test).
Second — huge amount of memory is allocated, but gradually, and previous layers of dp are being freed. So at the same time only two layers are allocated and maximum running time is same as in two-layer solution (170ms), but total allocated memory size is still huge.
I think that I just don't understand how this works, why allocating huge memory and not using it is fast, godbolt says that allocation happens so no compier help on that side.
Maybe in the second example compiler optimized something too? However I'm not sure that's possible based on my knowledge. On the other hand, maybe computer decides to give the same memory block each allocation, so it's indeed equivalent of 2 layers dp?
C++
new int64_t[...]
anddelete
operate on heap memory. Since requested memory blocks are of the same size, it indeed works almost exactly as a small fixed pool of arrays.As for difference in time, it is simply due to difference in number of active memory pages. The more pages your process touches, the more frequently page fault happens. Allocating huge unused array is fast because actual page allocation happens lazily.
So, the biggest offenders are not the page faults per se, but the initializations of new pages?
Obviously, it uses way less memory, only 30 Mb. What surprised me is that the running time has also decreased, from 472 ms to just 147 ms, for an improvement by a factor of 3!
Improved by a factor of 3, not 3!
It's exclamation mark, not a factorial.
upd: deleted
That's more important when you have a huge number of allocations (not same as total allocated memory). For example allocations of some tree nodes.