Topcoder Open 2011 started recently. This weekend was the first qualification round of the algorithm competition. I had an interesting experience during it, which I want to share.
The second problem was not that hard, though my solution seems to be too stupid for it. You can check out the statement here. I solve it with quite naive approach as follows:
double dp[80002][50];
class FoxListeningToMusic {
public:
vector <double> getProbabilities(vector <int> length, int T) {
memset(dp, 0, sizeof(dp));
int n = length.size();
for(int i = 0; i < n; i++)
dp[0][i] = 1.0 / (double)n;
double mul = 1.0 / (double)n;
int idx ;
for(int i = 1; i <= T; i++) {
for(int j = 0; j < n; j++) {
idx = i - length[j];
if(idx >= 0) {
for(int k = 0; k < n; k++)
dp[i][k] += mul * dp[idx][k];
}
else
dp[i][j] += mul;
}
}
}
vector<double> v(n);
for(int i = 0; i < n; i++)
v[i] = dp[T][i];
return v;
}
};
It is not important weather the code is solving the problem with correct answers, at least for what I am going to discuss. The fact is that I got time limit with this code. It was somehow expected as the complexity here is O(T * length.size() ^ 2), which becomes 2 * 108 if we take into account the problem constraints. However, the interesting thing is that I tested into the arena before submitting. The case I used seems to be "worst case" for my solution: 50 1s given in length and T = 80000. The code ran for 0.365 seconds. This is quite below the time limit of 2 seconds. I submitted and then was pretty surprised to see I almost did not manage to qualify to Round 1, with 250 ptr only.
I say the case I used is the worst case, because the number of instructions what will be executed depends only on the branching condition idx >= 0 in the inner for. If this is true one more cycle is to be executed (the cycle is of complexity O(n)). In the other case only a single operation O(1) will be executed. And as you can see the less the elements in length the more times this becomes true.
Even though this reasoning, my problem fails the following case:
length = {1, 1, 1, 1, 3, 3, 3, 3, 1, 3, 3, 2, 3, 2, 3, 3, 1, 2, 3, 1, 2, 3, 2, 1, 3, 1, 1, 1, 2, 3, 2, 3, 2, 2, 1, 3, 1, 1, 3, 1, 3, 1, 3, 2, 3, 1, 1, 3, 2, 76393} T= 77297.
For this case my program runs for 5.204000 seconds on my computer (my computer is significantly slower than Topcoder's, but I am going to show you runtimes on my machine, as here the ratio is what matters), whereas
length = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1} T = 80000 runs for 0.750000 on my machine.
My first assumption was that the reason for this unexpected ratio of runtime measurements (as long as we should expect that in the first case quite fewer processor instructions are to be executed) was that the processor caches somehow the similar calculations: in my example the calculations are symmetric with regard to all the elements of length and really clever processor can use this to spare repeating the same sequence of instructions. So I tried composing another example: this time with different values in length array:
length = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 77943} T=80000 runs for 0.813000 seconds on my machine.
After this example I am no longer able to say how come these time measures - my second example seems to require more processor instructions than the test that failed me and does not allow the caching I thought was happening in the first example. Actually I have not been able to define the cause of this behavior, but I am quite sure it should have something to do either with processor caches or conveyors. I am very curios how those experiments will behave on different chipsets, so feel free to comment here.
Also, if there is anyone more knowledgeable in hardware than me and he/she can explain this behavior it will be appreciated.
Until then there is a note I should make for myself - when estimating the algorithm complexity do not underestimate the processor optimizations. Sometimes, they seem to decrease/increase significantly the amortized speed of specific examples.
PS: I wonder have Topcoder guys specifically thought of such "processor breaking" examples?
Just add a line
if( dp[idx][k] > 1e-300 )
before
dp[i][k] += mul * dp[idx][k];
and get solution in 867 ms.
And the reason is subnormal numbers below 1e-306, that makes program work about 50-100 times slower.
...
1.0E-298: 121
1.0E-299: 122
1.0E-300: 121
1.0E-301: 122
1.0E-302: 121
1.0E-303: 121
1.0E-304: 6374
1.0E-305: 6373
1.0E-306: 6377
1.0E-307: 6375