Bizarre slowdown when using cin on a 64-bit compiler

#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

#	User	Contrib.
1	cry	167
2	-is-this-fft-	162
3	Dominater069	160
4	Um_nik	158
5	atcoder_official	156
6	Qingyu	155
7	djm03178	151
7	adamant	151
9	luogu_official	150
10	awoo	146

Hello codeforces! I've come to ask help in a time of great mystery.

I was helping to debug a problem where the code would pass the 2s TL when on a 64-bit compiler but would get confortably under 1s on a 32-bit one when some really bizarre things started to happen.

First, pinning down where the problematic parts were, it turns out this was the thing consuming practically the whole time limit:

set<ll> s;
for (int i(0); i < k; ++i) {
    int pp;
    cin >> pp;
    s.insert(pp);
}

Im not joking. Also k <= 3e5 btw.

It really shouldnt be a big deal, and normally that part should take less than 140ms, but both on C++20 (64) and C++17 (64) make it take from 1900ms to 2000ms.

I made a submission that outputs the time taken to do the input, the inserting, and the ammount of elements in the set; and with that even more weird stuff appears. Look at the result:

142730404

Output

0:
Input Time: 0ms
Insert Time: 0ms
Set Size: 0

50000:
Input Time: 36ms
Insert Time: 38ms
Set Size: 50000

100000:
Input Time: 74ms
Insert Time: 83ms
Set Size: 99904

150000:
Input Time: 566ms
Insert Time: 142ms
Set Size: 99904

200000:
Input Time: 1059ms
Insert Time: 200ms
Set Size: 99904

250000:
Input Time: 1543ms
Insert Time: 244ms
Set Size: 99904

In this test case until the 90000-ish iteration there are distinct elements and after that the input time starts to raise aggressively, while the insert time grows at a basically unaltered pace. Maybe they start repeating a relatively big number, but even then, it goes from 30ms to read 50000 guys to 500ms. It's insane.

If you want, take a look at the version of the output when using 32-bit for comparison (the numbers will actually be a bit inflated because you need to get the timings with std::chrono):

142730436

Output

0:
Input Time: 0ms
Insert Time: 0ms
Set Size: 0

50000:
Input Time: 41ms
Insert Time: 47ms
Set Size: 50000

100000:
Input Time: 82ms
Insert Time: 101ms
Set Size: 99904

150000:
Input Time: 123ms
Insert Time: 151ms
Set Size: 99904

200000:
Input Time: 164ms
Insert Time: 198ms
Set Size: 99904

250000:
Input Time: 204ms
Insert Time: 240ms
Set Size: 99904

Final Input Time: 245ms
Final Insert Time: 285ms
Final Set Size: 99904

Ending thoughts

The problematic test case is case 6, however case 5 has the same input size and the set ends up with more items (see 142732612). Why is the problem only with test 6???
If instead of reading a variable and instantly inserting it into the set, you first input an array and then insert it into the set, no tragedy happens (142732784).
How can the input go so wrong? A 20x slowdown out of nowhere should not be happening.
Is there a way to get the raw test data from the authors or something similar to see what actually was on test 6 so we can try to replicate it?
Thanks qwexd for doing nothing.

Comments (9)

Write comment?

Everule

3 years ago, # |

+37

Bumping post. I personally don't see any plausible reason why cin would randomly slow down. My best hypothesis would be that undefined behaviour likely edited some internal variable in cin, which could cause the slowdown, but that would imply there is a bug in the implementation of std::set. Just to rule out some other factors I would recommend checking the number of digits read in each block and printing that as well.

→ Reply

MrDindows

3 years ago, # ^ |

+35

It seems that problem happens when set reaches size of exactly 99904. The numbers in set don't matter. Check this submission: https://codeforces.net/contest/796/submission/142771971

And I have no idea why this happens.

defnotmee

+43

Ok i made the debug trigger on test 5 instead with your code and now it TLEs like when on test 6, so not only do the numbers on the set dont matter but also the numbers on the input 142784418

I also tried changing it to other size to other sizes like 100000, 99904+1, 99904-1, 99904+2, 99904/2, 99904/7, 99904*2, but none of them TLE like 99904

← Rev. 2 →

+34

Moreover, I create 2 sets and insert each number into random set, and the issue is reproducible only if total size of 2 sets is 99904. https://codeforces.net/contest/796/submission/142787913

Also it doesn't seem to be related to memory. Because we can add some other structures, change size of the arrays etc, nothing changes.

GILGAMESH

← Rev. 4 →

+36

Honestly, I can’t think of any reason outside of cache locality, but it is not super convincing to me either.

Observations:

Swapping set for unordered_set actually makes C++x86 2x slower: 142768060 142767613. This seems to all be accounted for in the timing for cin >>.
Using custom buffer + fread alleviates the time diff: 142776426 142776447.
Note that the C++x64 compilers run on Windows while the x86 compilers do not; there may be system differences.

Problems:

C++x64 seems to be slower for some inputs.
Using unordered_set reverses the time diff.
Slowness is always accounted for in the cin >> operation.
Using array intermediary solves the problem (1).
Using fread with custom cache solves problem (1).

Some possible answers:

C++x64 runs on a system with smaller cache than C++x86 and misses far more for certain inputs.
unordered_set prefers a larger cache than set (which might make sense considering it is a hash table). But this cache preference is larger than both system’s cache sizes. So when both C++x64 and C++x86 see lots of cache misses, C++x64 starts performing better…
As we insert more into the set or unordered_set, eventually the cache is too small, and we always kick the cin buffer out of the cache when doing the insert. So timing cin >> accounts for loading into the cache at that point and looks much slower when the set/unordered_set reaches a large size. (3.1) But why do we not get the same slowdown for .insert when we reach cache limits? Likely because .insert nearly always cache misses as it traverses different paths in the underlying BBST regardless of the number of nodes. So there is no slowdown as it gets large, since it is always slow.
Maybe the array is much smaller than the cin buffer and easier to fit in cache. But I think streambuf is fairly light on memory… maybe cin has a large putback buffer too. (4.1) This is a stretch, but the compiler could have optimized the code here to sort the array and then construct a BBST out of a sorted array, which is much faster than constructing a BBST from random order. godbolt doesn’t seem to support this hypothesis though, so take it lightly.
Honestly the reason for this is probably similar to the reason why using an array intermediary solves the problem.

+25

Thanks for your insight! It seems that MrDindows's observation also supports your idea of super weird cache misses being the problem, since the TLE happens only for specific set size/sizes.

I also tried priority_queue (142783105) and (funnily) the pbds set (142782176) and they also both pass.

A couple of points id like to address:

C++x64 runs on a system with smaller cache than C++x86 and misses far more for certain inputs.

The particular input doesnt seem like the problem. Forcing the size of the set to be 99904 also makes it TLE on another test case: 142784418

This is a stretch, but the compiler could have optimized the code here to sort the array and then construct a BBST out of a sorted array, which is much faster than constructing a BBST from random order. godbolt doesn’t seem to support this hypothesis though, so take it lightly.

I think it's possible the compiler tries to optimize something by ordering things differently (maybe reading like 50 numbers at a time and then inputting those 50 numbers into the set, idk) and the fact that trying to isolate the timing with chrono like in the blog makes the total time be really inflated (142730436 shows it took around 500ms to input and insert, while 142729216 says 137ms). Even if this is true i dont think its the sole reason tho

Kyou_mo_kawaii

+20

In general the compiler is allowed to reorder those timing calls so microbenchmarking is pretty hard: https://stackoverflow.com/questions/37786547/enforcing-statement-order-in-c (there's a nice DoNotOptimize helper in the first answer that can help with this)

I wonder if this test case works on other problems where the natural solution is to read the input into a set. This might start a new era of std::set being unusable... (like how vanilla std::unordered_map is unusable due to hacking hashes today)

beep_boop

Sorry for bumping this post. But, has anyone found a way to reproduce this time limit exceeded on other problems?

I tried to read input 99904 times this way in other (permutation) problems, but got no time out. 157125980 157127092

In OP's solution, I changed cin to scanf and it also worked somehow. 157000430

eggag32

18 months ago, # |

+112

I just ran into the same issue. My solution to today's F got TLE49. However, if I read and store the array before adding indices to corresponding vectors, it passes in about half of the time limit. In fact, when I ran the first code in a mashup, it took 14s! In 32-bit compilers, both run in time.

defnotmee's blog

Ending thoughts