I know the post is a bit long, but I think it is worth to know the whole story.
I attempted to solve div1D from round #623 (VK Cup) with a randomized solution (that is one of the official solutions). After writing the code, I wondered how many iterations of this randomized algorithm I could do to fit in TL. So, I generated a maxtest (that is, n=80 and k=10 in this task) and I ran it in custom invocation. I saw that I could do 60 000 iterations in about 2.4 — 2.5 seconds. Good enough, right? After all, the more iterations, the smaller chances of being hacked / getting WA (even though this number of iterations is ridiculous). So I submitted the code (and here is a bit prettier code without templates/defines).
As you can see, I got TLE. At first thought, there could be multiple reasons for it, because it is a randomized algorithm. But the thing is, my code had constant seed and constant number of iterations. It also doesn't depend on the numbers in the input (excluding n and k), because I'm only adding numbers, taking their min's and printing one number on the output. So that means that the code will run exactly in the same way for each input n=80 k=10 (excluding the really small time needed to input numbers). That is why I thought that when my solution fits in time limit on custom invocation, it'll safely pass systests.
After the contest, I looked in my runtimes of systests (I had to lower the number of iterations, for example by a factor of 2, but then I multiplied back the numbers given here). Exactly 100 tests with n=80 k=10 had a runtime of 2.8-3.1s, four had 2.2-2.6s (one as low as 2.34s) and one at a devilish 5.458s, the test #35 (I had to spend some effort to retrieve this test). I ran this test locally and, compared to some randomly generated tests, the test #35 seemed... to have one of the smallest runtime?? And a few of the tests I generated had suspiciously high runtimes locally (1.5x slower compared to other tests), but on custom invocation they run at a standard 2.8-2.9s.
I contacted someone from codeforces team about this issue, though he didn't reply. So I contacted someone that I know is good in C++. After a short moment, we were using a profiling tool (perf, to be precise) to check what is the instruction count for some given tests (so we read hardware counters).
The culprit
The number of (assembly) instructions was very close, that couldn't be the problem. What really stood out was the branch prediction and branch misses. You can also see the difference in instructions per cycle ("insn per cycle", number of cycles $$$\approx$$$ CPU time). Logically the number of instructions per cycle should be below 1, but because of some voodoo processor parallelization shit (but actually because of speculative execution e.g. branch prediction, pipelining, and certainly also something else) it is above 1 and here it differs greatly (so the number of instructions is roughly the same, but they were done more slowly). I really recommend the short introduction to the branch prediction on stackoverflow to understand the topic I'm writing about.
The explanation
At the bottleneck of the algorithm (dp[curr_k][end] = min(dp[curr_k][end], dp[curr_k - 1][start] + cost[start][end]);
), my solution uses addition and min function, and that min function happened to be problematic. The average guy (and c++ std) writes this function as int min(int a, int b) { return a > b ? b : a; }
and that uses one if
statement (aka branch). The CPU is able to partially remove the slowness of this if
statement thanks to branch prediction, but the branch prediction is not always right, but the more right it is, the faster the execution. As you can see on the perf results, on some particular tests (usually where there is a lot of random numbers), the computer is worse at predicting which part of the if
statement is going to be executed. Because of problems at this particular place (that is executed the most during runtime), my code ran over two times slower on particular tests.
Strangely, the branch prediction is dependent on how the code is compiled (for example, it seems important in which order the instructions are executed). That (somehow) explains why O3 is faster than O2 and O2 is slower than without optimizations??, and this code compiled with clang++ gives better results than with g++, and codeforces' compiler (MinGW probably?) fails at the mysterious testcase #35 but g++ nails it there, but fails at some testcases where codeforces is good, and some other strange, completely usual compiler stuff.
To me, it seems really magical that the difference in runtime is that gigantic. The intuition tells me that the runtimes should be the same. By trusting custom invocation and my intuition, instead of 2390-ish rating, I ended up having 2280-ish. Certainly other people failed more miserably and more stupidly than me, and that also doesn't justify the fact that I should have known that branch prediction could have such an impact on the runtime. That's why I'm writing this blog — to teach or remind you that such problems exists.
Similar problems
There is a big amount of codeforces posts about cache, for example On Cache Friendliness or Tree queries for O(nq). The perf tool that I mentioned is good at detecting such stuff.
Also, pragmas exists, but I'm not really knowledgeable about it.
A nice advice is to use std::array or something array-ish instead of vectors with small sizes, because the emplace_backs take some time with small vectors. Or at least use vec.resize() when the vector size is known. Sorting small vectors is also slow — for example, it is much faster to manually sort a vector with size 3.
Of course, when possible, you should use an iterative approach instead of recursive one, and it is nice to understand why the recursive approach can be bad.
Did I miss anything? Do you have some stories related to optimization/constant factor problems, or some advice?
Yeah. For some assembly code, the CPU can decide to execute instructions in a better order, for example. Then you have things like the branch delay slot — an instruction that doesn't affect the branch calculation can be executed while you're waiting for the result of the condition evaluation; whether such an instruction is executed and which one it is depends on the assembly code too. There's a lot of these small tricky things about exact execution, the most we can conclude is how to write assembly that the CPU likes.
It's really weird the compiler won't use branchless comparison/assignment instructions: https://godbolt.org/z/TzD7ei.
Your code with -O3 flag forced by a pragma passes: 71847043, and interestingly enough, test 35 is solved noticeably quicker than other maxtests. I guess somehow CF GCC compiler (I don't know what version it is) decides to put branches in the code with an -O2 optimization flag, and remove them with -O3.
Well, that's funny /s. I tested my code with some branchless min function based on bit operations (here is the submission), it was better on test 35, though the overall time was even slower