imposter_syndrome's blog

By imposter_syndrome, history, 6 years ago, In English

Hi Everyone, i have been seeing this for some days now on codeforces. In yesterday's educational round, I submitted problem D with https://codeforces.net/contest/1107/submission/49013323 solution. Later I realized that it takes minimum 5040*5040*60 operations in worst case. After contest, i tried hacking my solution but it passed successfully. Also I think testcase 20 is having same worst case input (all 0's and last rows last element if say 1), but it passes there in 732 ms. In my local env it takes more than 4s for this to pass.

Also in CF#534, I tried hacking this O(n^2) solution https://codeforces.net/contest/1104/submission/48743647, which was doing more than 10^9 operation on worst case (ababa....aa..bababababa string). But it passed in 640 ms. From my understanding it takes 1s for approx 10^8 operations.

Has codeforces upgraded there testing servers or I am missing something here?

»
6 years ago, # |
Rev. 2   Vote: I like it 0 Vote: I do not like it

I don't know about your code( didn't even opened it ) but I used bitset, which divides the operations by 32. And btw in worse case it takes 5200*5200*48( not 60 ).

  • »
    »
    6 years ago, # ^ |
      Vote: I like it +16 Vote: I do not like it

    5040 has 60 divisors. It takes 1524096000 number of operations (+ some other computations) :) While 5200 has 30 divisors and it takes 811200000 operations. Worst case is 5040 from my understanding.

    • »
      »
      »
      6 years ago, # ^ |
      Rev. 2   Vote: I like it -70 Vote: I do not like it

      OK. Anyway, you don't need to post a blog for such things!!!

      • »
        »
        »
        »
        6 years ago, # ^ |
          Vote: I like it +26 Vote: I do not like it

        Ohh. sorry didn't know exact way to ask this. I thought other people may also be seeing this and it would be good to ask if something is missing. Sorry for all the inconveniences it caused to you

        • »
          »
          »
          »
          »
          6 years ago, # ^ |
            Vote: I like it -53 Vote: I do not like it

          I don't care, but as I know they'll downvote, maybe a comment somewhere on tutorial or announcement could be enough...

          • »
            »
            »
            »
            »
            »
            6 years ago, # ^ |
            Rev. 2   Vote: I like it -27 Vote: I do not like it

            Seems they just downvote me... Why always me?? What have I done to you?? OK downvote, I love downvotes!!!

»
6 years ago, # |
  Vote: I like it +5 Vote: I do not like it
  1. Also I think testcase 20 is having same worst case input (all 0's and last rows last element if say 1), but it passes there in 732 ms. You are using fast input, without it : 49042714 (732ms -> 1528ms)
  2. In my local env it takes more than 4s for this to pass. It depend on the power of your computer
  3. From my understanding it takes 1s for approx 10^8 operations. The time limit is 2.5s so i think 10^9 is fine after all ?
  • »
    »
    6 years ago, # ^ |
      Vote: I like it +16 Vote: I do not like it

    Just to add to his answer that 108 is a heuristic bound and must in no way understood as a hard cut. Sometimes, if operation per loop iterations are heavy, 107 might be tricky to squeeze in, whereas 109 may pass if per iteration work is simple.

    Also, cache plays a big role where memory accesses are linear, one after the other and can reduce time by factors. I believe in the question above, array accesses were quite linear and operations were just simple comparison.

»
6 years ago, # |
Rev. 2   Vote: I like it +19 Vote: I do not like it

I'm also interested in such question.

What has happened with C++ STL string?

Why solutions where string length is about 300 000 characters and which erase small peace (of 2 characters) in the center of the string while it is not empty passes in 1 second on Codeforces?

Isn't there about 3*1.5*10^10 operations? Or string is not a continuous memory block now, it is like an amazing "list" with indexation of elements?

Examle of 2 different problems and their solutions that I can't hacked and even final tests can't do it too:

1) https://codeforces.net/contest/990/submission/39115847 (passed in 1 second for 300 000 string!)

2) https://codeforces.net/contest/1104/submission/48742260 (passed in 0.078 second for 100 000 string!!!)

Are you kidding me? :))

What has happened?

  • »
    »
    6 years ago, # ^ |
    Rev. 9   Vote: I like it +25 Vote: I do not like it

    You forgot to divide number of operations by sizeof(YMM-register) / sizeof(char) const for a string. So, erasing from std::string takes n / sizeof(YMM-register) * sizeof(char) = n / 256 * 8 = n / 32 operations. Estimate is n*(n+1)/2 / 32 = 300000^2 / 64 = 1.4 * 10^9 operations. Most costly operation in this case is loading / storing data to / from registers. It is 1s = 10^9 in aligned by cache line size case and 2s = 10^9 in unaligned case on codeforces with sequential access and cache prefetching.

    • »
      »
      »
      6 years ago, # ^ |
        Vote: I like it +1 Vote: I do not like it

      So, memmove() can process 256 bits in one processor clock cycle?

      • »
        »
        »
        »
        6 years ago, # ^ |
          Vote: I like it +26 Vote: I do not like it

        Yes, it is.

        As additional, you can see AVX, AVX2 and FMA in my solution for gym problem G. Underpalindromity. With n = 200000 and k = 100000 it is n * k / 2 / sizeof(YMM-register) * sizeof(int) = 200000 * 100000 / 8 = 2.5 * 10^9 operations, I think.

        If submission not available for you, this is source code

        • »
          »
          »
          »
          »
          6 years ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          OMG, problem setters, let our restrictions for string/array length be 1e6 or 1e7 to avoid passing solutions with N^2 and architecture level optimizations.

          • »
            »
            »
            »
            »
            »
            6 years ago, # ^ |
            Rev. 2   Vote: I like it +2 Vote: I do not like it

            For example, you can't increase n and k in problem G. Underpalindromity, because there are solutions with correct algorithm in O(n log(n)) with working time 0.7s-1s.

            Learning algorithms and data structures — one way, learning architecture level's optimizations — another way. When you are using array instead list and iterate over it, you already using architecture level optimizations. Both knowledge can be combined, for example, you can use sqrt-decomposition with such optimizations and sqrt-list and it will be faster, than segment tree and cartesian tree.

        • »
          »
          »
          »
          »
          6 years ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          A bit simpler and faster solution: 49057315

          • »
            »
            »
            »
            »
            »
            6 years ago, # ^ |
              Vote: I like it 0 Vote: I do not like it

            Fantastic, speed up in 2 times and gcc can vectorize even with addition to int64_t!

            If you are interested, there are another problem 101341I - Matrix God "Matrix God" with 1000 x 1000 matrix multiplication by modulo 10^9+7 and fantastic solution 45798437 by Constantine Drozdov in 374 ms