minimario's blog

By minimario, history, 7 years ago, In English

Hi all,

I was solving this problem: Link. It's fairly easy, so if you don't really want to solve it, I'll go ahead and write the dp recurrence here:

dp[i][k] = max(dp[j][k - 1] + p[j] * (p[i] - p[j]))

A pretty obvious Convex Hull Trick (CHT) problem. But I'm getting TLE on the last test case. I opened some AC codes, and I didn't see much difference between our codes, so I'm wondering what makes mine so slow and the other one so fast!

436 ms code

My submission

If anyone has some insights, please comment (or PM me) with details!

(P.S.: If anyone has a fairly fast implementation of CHT that would like to challenge the 436 ms, go ahead and submit it :))

Thanks so much,

-minimario

  • Vote: I like it
  • +25
  • Vote: I do not like it

»
7 years ago, # |
  Vote: I like it +65 Vote: I do not like it

It seems you have fallen victim to locality of reference (wiki).

I switched the indices of the array last in your code, and now it passes in 513 ms.

  • »
    »
    7 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    orz, thanks so much! Needless to say, you have my upvote :)

    So from my understanding, since I am accessing last[k][1], last[k][2], ..., last[k][k], the computer "mindlessly" expects me to access last[k][stuff] every time, whereas last[1][k], last[2][k], ..., will cause a lot of cache miss penalty.

    So then, if I'm right, why didn't much change when I tried the same thing for the "dp" array? It seems to be accessing dp[1][k%2], dp[2][k%2], ..., and I hope that based on what I wrote before, dp[k%2][1], dp[k%2][2], ..., would be a faster option.

    • »
      »
      »
      7 years ago, # ^ |
      Rev. 5   Vote: I like it +15 Vote: I do not like it

      Perhaps "mindlessly" is not the right word.

      My (admittedly limited) understanding is that the processor loads contiguous chunk of stuff from RAM into cache (which is uber fast); so when you access last[k][2] after last[k][1], it is expectedly already in cache, so the access is fast.

      But your question still remains valid, and to resolve that, note that sizeof(dp) = 2 * 100005 * 4 bytes < 1 KB, whereas modern processors have L1 cache (the super fast kind) on the order of 32 KB, so my guess is that the processor loads the entire array into cache, so it doesn't miss at all.

      EDIT: I can't multiply for shit, it's < 1 MB, which is larger than size of L1 cache, but still in the right order of magnitude for total cache size.