Блог пользователя -14

Автор -14, история, 9 месяцев назад, По-английски

When I was doing the problem C. Pokémon Arena in the last Div. 1 round, I submitted my solution and got a TLE on pretest 20.

I was not suspecting anything but my code, I thought there may be some degenerate issue, undefined behavior or constant issue, but nothing really found. After some unsuccessful attempts, I was able to pass the pretest using fast IO. It's then I found something weird: it only takes a few hundred milliseconds, which is contradictory to my intuition: cin with sync_with_stdio(false) is fairly fast and it should not take up so much more (at least 10x) time.

After the contest, I submitted exactly the same code with different language. You know what?

It's not only me, and some other participants also encountered such issue. For example:

Here's snippet for the key code:

int n, m;
cin >> n >> m;
vector a(n, vector (m, 0));
vector c(n, 0);
for (int i = 0; i < n; i++) cin >> c[i];
for (int i = 0; i < n; i++) {
	for (int j = 0; j < m; j++) {
		cin >> a[i][j];
	}
}

But there are also some successful cin submissions using GNU C++20 (64). After investigation by (including but not limited to) Sugar_fan, Boboge, -skyline-, the key components of TLE are:

  • The language must be GNU C++20 (64).
  • vector must be of int. If the elements are long long, it passed. 249050206
  • The definition of 2D vector must be before reading c. If swap these two lines, it passed. 249050425

So here's the thing. It could not even be simply interpreted as some branch mispredictions or cache misses, it seems something is completely broken, and we still don't understand what is actually wrong.

  • Проголосовать: нравится
  • +392
  • Проголосовать: не нравится

»
9 месяцев назад, # |
  Проголосовать: нравится +3 Проголосовать: не нравится

Auto comment: topic has been updated by -14 (previous revision, new revision, compare).

»
9 месяцев назад, # |
  Проголосовать: нравится -8 Проголосовать: не нравится

GNU C++17(20) also fails. 249014238 Maybe it's related with 64bit platform. int is 32bit. There may be some bugs in the complier dealing with this case, I think.

»
9 месяцев назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

It seems that the issue is related to the Windows system or codeforces judger. I can't reveal the problem on my Linux(Ubuntu) computer.

»
9 месяцев назад, # |
  Проголосовать: нравится +2 Проголосовать: не нравится

Probably, it is caught by cache misses. You can swap indices by which you access data in the a array (vector a(m, vector (n, 0)); -> vector a(n, vector (m, 0));, a[x][i] -> a[i][x] and so on) and check if it gets passed.

  • »
    »
    9 месяцев назад, # ^ |
      Проголосовать: нравится +16 Проголосовать: не нравится

    It's not the issue. My reason is: replacing cin into scanf leads to AC. 249050013

    • »
      »
      »
      9 месяцев назад, # ^ |
        Проголосовать: нравится +3 Проголосовать: не нравится
      • »
        »
        »
        »
        9 месяцев назад, # ^ |
          Проголосовать: нравится +11 Проголосовать: не нравится

        Yes, it works. But why I said it's not the issue:

        1. Keeping the same memory structure, replacing cin into scanf passed. So swapping indices is not necessary.
        2. There are many subtle changes to let it work: even swapping two lines of code, or changing int to long long (which is bad for cache) as mentioned in the main text. I am not surprised if there are many more changes to pass, but that's not the real issue which should be supported by ablation experiments.
        • »
          »
          »
          »
          »
          9 месяцев назад, # ^ |
            Проголосовать: нравится +3 Проголосовать: не нравится

          249055299 should give some ideas ;)

          • »
            »
            »
            »
            »
            »
            9 месяцев назад, # ^ |
              Проголосовать: нравится 0 Проголосовать: не нравится

            That's interesting. I had never thought of the vector capacity is not a power of two when constructing it with size. This may contribute to the slowdown in an indirect (but still mysterious) way. Incorporating the long long observation, the bug trigger must include a weird specific vector size (in this case, $$$40000 \times 10$$$).

            But to my understanding of cache, using power of two as the low dimension size leads to more cache conflicts. So currently my main suspection is: there may be some allocator issues in STL, so that cin and 2D vector slowed down each other.

»
9 месяцев назад, # |
Rev. 2   Проголосовать: нравится 0 Проголосовать: не нравится

Maybe there is difference between std::ios_base::sync_with_stdio(false); and std::ios::sync_with_stdio(false);? I used first one and had no issue on 64-bit compiler: https://codeforces.net/contest/1936/submission/248927650 You used second one and caught TLE

  • »
    »
    9 месяцев назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    It seems not the issue. My code still got TLE. 249055450

  • »
    »
    9 месяцев назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    249054815 without sync_with_stdio(false) at all.

  • »
    »
    9 месяцев назад, # ^ |
      Проголосовать: нравится +39 Проголосовать: не нравится

    All of the sync_with_stdio functions are calling the same exact function. sync_with_stdio is a static member function inherited from the ios_base class. I presonally prefer to call it using cin.sync_with_stdio, but you could call it in all kind of ways. It doesn't matter how you do it.

»
9 месяцев назад, # |
  Проголосовать: нравится +9 Проголосовать: не нравится

C++ is a weird programming language where bugs can be elusive, only appearing seemingly at random. Small seemingly unimportant changes to the code can make them appear or dissapear.

  • vector must be of int. If the elements are long long, it passed. 249050206
  • The definition of 2D vector must be before reading c. If swap these two lines, it passed. 249050425

Just from reading this, it sounds like you likely have UB (undefined behaviour), possibly due to overflow.

Claiming that fast IO is the culprit is far too hasty of a conclusion. I would need to be 100% sure that no UB is present before drawing any conclusion like that.

  • »
    »
    9 месяцев назад, # ^ |
      Проголосовать: нравится +21 Проголосовать: не нравится

    I am 99.9% sure it's not my code that UB.

    See 249078517. On test case 20, it should be equivalent to something like this, which I don't think an UB can exist.

    #include <bits/stdc++.h>
    using namespace std;
    
    void work () {
    	int n, m;
    	cin >> n >> m;
    	vector <vector <int>> a(n, vector <int> (m, 0));
    	vector <int> c(n, 0);
    	for (int i = 0; i < n; i++) cin >> c[i];
    	for (int i = 0; i < n; i++) {
    		for (int j = 0; j < m; j++) {
    			cin >> a[i][j];
    		}
    	}
    	if (n == 40000 && m == 10) {
    	    return;
    	}
    	// all following skipped
    }
    
    int main() {
    	ios::sync_with_stdio(false); cin.tie(0);
    	int T = 1;
    	cin >> T;
    	for (int ca = 1; ca <= T; ca ++)
    		work();
    }
    
    • »
      »
      »
      9 месяцев назад, # ^ |
        Проголосовать: нравится +31 Проголосовать: не нравится

      I did some digging. I took multiple different C++ submissions from different users and modified their IO. Then by submitting to codeforces GYM I'm able to test the problem with a higher TL.

      First thing, I've only been able to trigger the slowness bug using cin, and not using scanf. I've also noticed that it doesn't not seem matter if you use cin.tie and/or sync_with_stdio. In all cases I've tested, the slowness bug triggered independently of this.

      One way to trigger the slowness bug is by reading input like this (It takes 4.1s on both TC 20 and 21):

      int n, m;
      cin >> n >> m;
      vector<int> c(n);
      vector<vector<int>> a(n, vector<int>(m));
      for (int i = 0; i < n; ++i)
          cin >> c[i];
      for (int i = 0; i < n; ++i)
          for (int j = 0; j < m; ++j)
              cin >> a[i][j];
      

      The following also takes 4.1s (on both TC 20 and 21):

      int n, m;
      cin >> n >> m;
      vector<vector<int>> a(n, vector<int>(m));
      vector<int> c(n);
      for (int i = 0; i < n; ++i)
          cin >> c[i];
      for (int i = 0; i < n; ++i)
          for (int j = 0; j < m; ++j)
              cin >> a[i][j];
      

      While reading input like this gives AC (typically < 0.2s):

      int n, m;
      cin >> n >> m;
      vector<int> c(n);
      for (int i = 0; i < n; ++i)
          cin >> c[i];
      vector<vector<int>> a(n, vector<int>(m));
      for (int i = 0; i < n; ++i)
          for (int j = 0; j < m; ++j)
              cin >> a[i][j];
      

      It is completely unreasonable that reading $$$4 \cdot 10^5$$$ integers would take 4 s. I have no clue what is actually going wrong here.

    • »
      »
      »
      9 месяцев назад, # ^ |
        Проголосовать: нравится +19 Проголосовать: не нравится

      Here is an update on this. I tested doing

      cin >> n >> m;
      if (n == 40000 && m == 10)
        work2();
      else
        work(n,m);
      

      where

      void work2 () {
          const int n = 40000, m = 10;
          vector<vector<int>> a(n, vector<int>(m));
          for (int i = 0; i < n; i++) {
              for (int j = 0; j < m; j++) {
                  cin >> a[0][0];
              }
          }
      }
      

      and it still TLEs on TC20 249166970. This is absurd. It gets TLE simply by reading $$$4 \cdot 10^5$$$ values into a[0][0].

»
9 месяцев назад, # |
  Проголосовать: нравится +6 Проголосовать: не нравится

To add on, I was easily able to reproduce this myself by swapping two lines in my submission: 249017883.

»
9 месяцев назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

Hmm, maybe there is a bug in the compiler or libstdc++: 249085796.

»
9 месяцев назад, # |
Rev. 10   Проголосовать: нравится +21 Проголосовать: не нравится

hey

I've set up test20 as a separate problem, you can add it by using 50437e8d91037c0fdbe5c65a in the mashup, or you can download here

I hope this helps in any way.

After deleting all calculations, it uses 4000ms for inputs


Observed some strange content, here's test code

code

It seems like it's just uniformly slow for no reason at all, There isn't any suspicious lag.

result graph
»
9 месяцев назад, # |
Rev. 2   Проголосовать: нравится +141 Проголосовать: не нравится

I've figured out how to trigger this slowness bug on basically any problem on CF!

  1. Take any problem with a decent amount of IO on cf (like 2e5 ints).
  2. Take a random AC C++ submission that uses cin.
  3. Add the line vector<vector<int>> TLE(40000, vector<int>(7)); somewhere in global space.
  4. ???
  5. TLE

For example take Tourist's solution 248948695 to problem 1936-D - Bitwise Paradox. With the vector of death added to the code 249181295, it gets TLE on TC5 (taking 5s). While without the vector of death, Tourist's submission takes 155 ms on TC5.

  • »
    »
    9 месяцев назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    When you compile the code locally asm and submit it directly (a cheat trick known to many), it will still result in a TLE, but locally there is no problem.

    My guess is that there is a problem with memory allocation in codeforces 64 bit.

  • »
    »
    9 месяцев назад, # ^ |
      Проголосовать: нравится +47 Проголосовать: не нравится

    Here is a stand alone example with the slowdown (credit to kostia244 for showing me this)

    #include<bits/stdc++.h>
    using namespace std;
    vector<vector<int>> TLE(40000, vector<int>(7));
    int main() {
        int n = 17;
        string s;
        for(int i = 0; i < n; i++) s += to_string(i) + " ";
        
        for (int j = 0; j < 30000; ++j) {
            istringstream ss(s);
        
            int x = 0;
            while (ss >> x);
        }
    }
    

    The vector of death makes this run about 100 times slower.

    • »
      »
      »
      9 месяцев назад, # ^ |
        Проголосовать: нравится +71 Проголосовать: не нравится

      Maybe, it's time to invite MikeMirzayanov to have a look at the problem. It's almost not reproducible on any other environment, so only administrators can further investigate the issue.

    • »
      »
      »
      9 месяцев назад, # ^ |
      Rev. 3   Проголосовать: нравится 0 Проголосовать: не нравится

      original comment: replacing " " with to_string(' ') take 290ms as pointed out below this is because ' ' is 32 lol

      Reproduced it for 64 bit integers using size 4 for inner vector

      #include<bits/stdc++.h>
      using namespace std;
      vector<vector<long long>> TLE(40000, vector<long long>(4));
      int main() {
          int n = 17;
          string s;
          for(int i = 0; i < n; i++) s += to_string(i) + ' ';
          
          for (int j = 0; j < 30000; ++j) {
              istringstream ss(s);
          
              int x = 0;
              while (ss >> x);
          }
      }
      
      • »
        »
        »
        »
        9 месяцев назад, # ^ |
          Проголосовать: нравится +8 Проголосовать: не нравится

        AFAIU, it was intended to generate a string containing n integers to pass it to istringstream. You broke this because to_string(' ') is just to_string(32).

»
9 месяцев назад, # |
  Проголосовать: нравится +17 Проголосовать: не нравится

I'm having strong flashbacks by comparing this blog and https://codeforces.net/blog/entry/108168 ...

It feels kinda frightening when you realize that your correct solution can get TLE due to some absolutely unexpected stuff D:

»
9 месяцев назад, # |
  Проголосовать: нравится -8 Проголосовать: не нравится

what about basic_string or valarray?

»
9 месяцев назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

How do you install GNU compiler on Windows? I have the MinGW compiler from CodeBlocks (added to PATH) and I can't even compile with -c++17 using it.

  • »
    »
    9 месяцев назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    I installed mingw64 compiler from this (don't follow any link, download from this page).

    g++ -v

    Now my g++ version is 13.2.0 and it supports c++20. I took this test and compiled with g++ main.cpp -std=c++20 -o main.exe. It executes in about 600 ms on both c++20 and c++17. I don't see strange behaviour.

    Did someone experienced this locally?

    • »
      »
      »
      9 месяцев назад, # ^ |
        Проголосовать: нравится +3 Проголосовать: не нравится

      I installed specifically version 11.2.0, used compilation flags from here:

      g++ -Wall -Wextra -Wconversion -static -DONLINE_JUDGE -Wl,--stack=268435456 -O2 -std=c++20 %1

      and used this program from one of the comments above.

      Code

      Still, no strange behaviour. It executes in about 850 ms.

      • »
        »
        »
        »
        9 месяцев назад, # ^ |
          Проголосовать: нравится +22 Проголосовать: не нравится

        Wait, I just noticed strange thing on Codeforces.

        I just ran the exact this code in custom test. It shows time 4851 ms always and actually, the time between the moment I press button Run and log appears is about 1.5 seconds.

        Seems the problem is in judge system, not the compiled program.

        • »
          »
          »
          »
          »
          9 месяцев назад, # ^ |
            Проголосовать: нравится +28 Проголосовать: не нравится

          I think that is simply due to codeforces caching the result of the invocation. You need to change the code/input slightly to make codeforces actually run your code.

          • »
            »
            »
            »
            »
            »
            9 месяцев назад, # ^ |
              Проголосовать: нравится 0 Проголосовать: не нравится

            Indeed.

            However, the question remains: could someone experience this locally, or it is the issue of only mysterious codeforces compiler.

»
9 месяцев назад, # |
  Проголосовать: нравится +18 Проголосовать: не нравится

A while back i faced a similar issue: https://codeforces.net/blog/entry/99038

Maybe they are related

»
9 месяцев назад, # |
  Проголосовать: нравится +13 Проголосовать: не нравится

I have new conjecture. Look at this submission: https://codeforces.net/contest/1936/submission/249331767 It is same as the one that gets TLE, but with fast allocator. So I suppose that each invocation std::cin allocates and free memory which is slow due to "vector of death" allocating many small chunks of memory. And scanf seems to not allocate or free memory.