TL;DR
Currently people are discussing a slowdown bug on Codeforces that seems to happen randomly, and can cause code to run 100x slower and get TLE. More details in pajenegod's blog post.
In this article, I present a mitigation: add the following to your code.
#include <windows.h>
void *operator new(size_t size) {
if (void *ptr = HeapAlloc(GetProcessHeap(), 0, size ? size : 1)) return ptr;
throw std::bad_alloc{};
}
void operator delete(void *ptr) {
HeapFree(GetProcessHeap(), 0, ptr);
}
TL;DR, part 2
This section is for Codeforces admins including MikeMirzayanov.
I have very high confidence that some poorly implemented functions in Microsoft's C runtime is the root cause of the bug which is outside contestants' control, creating unfair disadvantages in contests. The bug has likely existed since C++17 (64-bit) first came out on Codeforces.
Every C++ submission on Codeforces (and also the entire Codeforces infrastructure) currently depends on that library, on top of the internals of Windows that very few people has any clue about. Both of these are infamous for having bad implementations in many places, and I don't think it's a good idea to rely on those for one of the online judges with the most users.
I ask Codeforces developers to seriously consider other, more reliable possibilities, for example an option for judgment on Linux machines, possibly using one of the plenty judge server implementations already available. An increased stability and robustness would hopefully make contest and judging experience on Codeforces a lot better.
What's happening?
There has been a bug on Codeforces causing random TLEs recently. Sometimes when a seemingly unrelated line of code is added, or the order of two lines are swapped, or a vector size is changed, etc., the code suddenly becomes 100x slower on certain inputs. The bug got some attention because of its unexplained behavior, and the fact that many submissions (include tourist's) that should have passed a problem got TLE or were hacked because of this. An example showing the bug (by kostia244):
#include <bits/stdc++.h>
using namespace std;
int main() {
vector<vector<int>> TLE(40000, vector<int>(7));
string s;
for (int i = 0; i < 17; i++) s += to_string(i) + " ";
for (int j = 0; j < 10000; ++j) {
istringstream ss(s);
int x;
while (ss >> x);
}
}
A loop of about 10000 * 17 iterations takes 1.5s to complete. The weird and funny thing about the bug is that if 40000 or 7 is changed slightly, or if the vector is moved to the middle, then nothing happens and the code runs normally. This peculiar behavior indicates something is buggy and TLE is not caused by the code's authors. It only happens on Codeforces, and only with 64-bit C++. No one was able to reproduce it anywhere else or explain why.
After pajenegod also posted a blog on the situation, people in the comments (drdilyor, bkbtpout, et al.) quickly discovered that it's only the (de-)allocations that matter. Here's my attempt at reducing the code to only new
and delete
:
#include <bits/stdc++.h>
using namespace std;
void *alloc(size_t s) {
auto c = new char[s];
asm volatile("" ::"r"(c));
return c;
}
int main() {
auto c = alloc(0x1c);
for (int i = 0; i < 40000; i++) {
alloc(0x1c);
}
delete[] c;
for (int i = 0; i < 100000; i++) {
delete[] alloc(0x1c);
}
}
This consistently takes around 900ms in both C++17 (64) and C++20 (64).
An overview of what I investigated in the past few hours
The vector<int>(7)
in one case immediately caught my attention, because it makes an allocation of 28 bytes. 28 is special because it's $$$32 - 4$$$, where $$$32$$$ is a common chunk size of various memory allocation algorithms, and $$$4$$$ is the size of a pointer used to store chunks (if you didn't know, on Codeforces malloc
uses 32-bit libraries and returns 32-bit pointers, even with 64-bit C++. you are scammed). This combined with the fact that changing the 7 makes the bug disappear, hints that something is wrong with allocations.
By default new
and delete
with GCC simply call the malloc
and free
functions from the C standard library. I tried adding some time measurements to see if malloc
is really at fault. And surely enough, it is.
Knowing that it's not caused by the C++ STL themselves, but the allocation functions called when creating them, debugging became a little easier.
With my superficial knowledge of heaps, I can quickly guess what is going on. The 40000 malloc
s eats the top chunk to a specific size, then the first free
sends a chunk into the tcache, or the 0x20 fastbin if that's not a thing. Then the next sequence of malloc
/free
s keep taking and returning the same chunk, and this combined with the 40000 mallocs
somehow breaks the allocation algorithm, making it take a lot more time, is what happened, right? Oh, wait. uh. no...
...wait. WAIT. I forgor 💀
The above is what might happen in GNU's C library. Codeforces runs on Windows, using Microsoft's C library. What does that do?
Codeforces's C++20 (64) uses WinLibs, which uses either MSVCRT (Microsoft Visual C++ Runtime) or UCRT (Universal C Runtime) as the C standard library. We don't know what exact setup Codeforces has, but we can first investigate what each library does. Fortunately, the source code of both are available.
Let's look at UCRT first. In non-debug mode, malloc
simply calls _malloc_base
, which is just a wrapper to the Windows API HeapAlloc
. Similarly free
calls HeapFree
. These memory management functions are provided by the operating system, could they be the source of bugs? We can test this by replacing malloc
and free
calls by the API calls directly.
This runs normally, which means that UCRT is unlikely the reason of the bug.
Then we look at MSVCRT's source code. It also uses Windows API functions, except when it doesn't. Depending on the system version, there are 2 more implementation of memory allocation algorithms, V6_HeapAlloc
and V5_HeapAlloc
, which seem to allocate memory by looking through a linked list for free chunks. I did not have enough time or energy to read the implementation, but it is possible that the function has bad performance in certain cases.
Then I found this interesting article, which shows that MSVCRT's malloc
might really be some terrible implementation, and Codeforces users are not its first victims.
If we can know for sure which C library C++20 (winlibs) and C++17 (msys2) use, we can basically pin the possibility down to one bad function in a library. However, it's difficult to get the exact setup on Codeforces even with custom test. If only there was a way to download compiled binaries from Codeforces... wait, there is. On Polygon, full problem packages include compiled versions of checkers, validators, etc. I can simply add a generator to a random problem and download the package to obtain an executable file (although only C++17). Fortunately, it has some debug info, and is statically linked.
After briefly inspecting the 3MB, impossible-to-reverse-engineer PE32+ executable, I can confirm that it indeed uses MSVCRT's malloc
. Now we can basically be sure that the cause of the weirdest slowdown bugs in Codeforces history is the poor implementation of the C runtime.
A possible fix?
Now we know the exact issue that caused the mysterious slowdowns: Microsoft's astronomically disastrous dumpster fire C library. How to fix this? It's easy, just replace it with memory management libraries that are known to work well, for example jemalloc, mimalloc (ironically made by MS themselves to replace their library functions they themselves know are terrible), or ptmalloc. One could also replace the entire library with, say, UCRT or GNU libc.
But as regular Codeforces users we can't control what libraries we use; only Codeforces is able to change that. However, we can rewrite our code to not use the C library to allocate at all. It's possible to do it by calling raw Windows API functions, as shown above. The following is copy-pasted from the TL;DR section:
#include <windows.h>
void *operator new(size_t size) {
if (void *ptr = HeapAlloc(GetProcessHeap(), 0, size ? size : 1)) return ptr;
throw std::bad_alloc{};
}
void operator delete(void *ptr) {
HeapFree(GetProcessHeap(), 0, ptr);
}
Does this fix the slowdown bug? No one will likely ever know, because unlike Microsoft's C libraries or STL, the lack of source code (and the sheer complexity) makes the Windows API impossible to understand. But at least we now know the bug is probably caused by library functions and can pray that skipping the library solves it.
In a way you could say that the bug is not Codeforces's fault... but it's not really not their fault either, as there is a very simple solution (that would also solve countless other issues reported by users before): run GNU/Linux, like literally every other online judge of considerable scale.
Update: About hacks
After the bug was discovered, people are hacking submissions by using special inputs that cause a certain sequence of allocations to happen. Since this is a problem of the library and not of contestants' codes, I don't think it's fair doing that.
A similar case is hacking bad std::unordered_map
, compared to, e.g., hacking bad rolling hashes. The latter is a problem with contestant implementations, and I enjoy hacking them. But the former is just bad implementation from C++ STL, and isn't really the contestants' fault, so I don't usually hack them. This problem is more well-known though, and has many simple fixes available, so it's probably OK to hack them.
Before this slowdown bug and its fixes are well-known, I think hacking using the bug is equivalent to exploiting n-day vulnerabilities on Codeforces to manipulate verdicts, and I'm not very supportive of that. Let's hope a patch is deployed soon.
Afterwords
I'm not touching Windows or PE for the next 6 months. Leave a comment if you play CTFs and understand my feeling.
Subscribe for more posts like this where i solve CP bugaboos with the most absurd techniques
How so orz sir how can I be as good at low level stuffs as you
play ctf and read solutions.
ctf ?
I think he meant Capture The Flag
Should we use C++17 until it is fixed?
Yes, in fact currently Codeforces disabled GNU C++20(64) language option.
Not only C++20(64) but also other compilers using MSCVRT, along with MSC++2017.
Also I once heard that the
gp_hash_table
in pbds contains bugs. Not sure if it's true or not, I decided not to use it any more.What alternative do you use for it (asking because I also use it sometimes to squeeze in unintended sols, mostly failing, but sometimes it is useful)?
cc_hash_table
, of course.The only "bug" I know of is that doing
swap(a,b)
on twogp_hash_table
s isO(n)
, while it is O(1) usingunordered_map
. As seen in this comment.gp_hash_table
is also slightly easier to anti-hash hack thanunordered_map
sincegp_hash_table
mods the hash by some power of 2. So if someone is usinggp_hash_table
without a custom hash, then they can be hacked by simply being given multiples of some sufficiently large power of two..
I just checked. Remind, that I have MinGW version 11.2.0 from here and I didn't encounter slow execution. A comment in output asm file says, that it uses
UCRT
.But I have interesting question not related to this topic.
In asm code the malloc is replaced with mysterious
_Z5allocy
(with functions_Znay
and_ZdlPv
), which is declared through.globl
. I coundn't find any information about_Z5allocy
neither in Google, nor in MinGW repo, nor in UCRT repo, nor in my local mingw64 directory. How it can be possible?_Z5allocy
is justalloc(unsigned long long)
(a.k.aalloc(size_t)
).These are mangled function name. To demangle these kind of functions, you can try execute
c++filt _Z5allocy
on your computer (c++filt
is part of the binutils project)Thanks.
I continued a little and used
objdump
on compiled to .exe program. There is_Znay
function, which simply calls__imp__Znay
function.There is also
_imp_Znay
label, but it doesn't look like a function: all__imp__*
labels have size 8 bytes with second byte equal toa5
and bytes 3-8 equal to00
. I guess, this is where the pointer to dynamically linked function is substituted, and first byte is something like id.I found a simple way to determine whether
HeapAlloc
/HeapFree
are used. If they are, then sincemalloc
andHeapAlloc
are basically equivalent, the following code would run normally:Using Codeforces custom test:
In C++17 (64) and C++20 (64), it returns with a STATUS_HEAP_CORRUPTION error, indicating that
malloc
's result did not come fromHeapAlloc
directly and was possibly handled in another library function.In C11 (32) and C++17 (32), it returns with a STATUS_ACCESS_VIOLATION error (sometimes a larger program is needed to trigger), indicating something similar. A different behavior can mean that they use a different implementation from 64-bit C++. Since they don't seem to be affected by the slowdown bug, this might be the case.
It works normally in Visual C++ 2017 (larger programs with only
delete
replaced also works), which means that its implementation callsHeapAlloc
internally.Great blog mister, things make a lot more sense now.
Yet another L for closed source, it is crazy how bad they can fumble such a low level API call and how that can go unnoticed. Hope mike actually does something this time now that we have much better evidence that there is a problem.
How can I attain proficiency in fundamental concepts like you ?
Kindly elucidate, esteemed sir, upon the strategies and techniques through which I may aspire to achieve a level of competence akin to yours in the realm of tasks that are of a more rudimentary nature.
but how?
One more reason to write your programs directly in binary code.
Is it prudent to stick with C++17 until the issue is resolved?
I am so lucky that my submission was hacked after the system test,and I changed to C++17 to accept.
Then I became International Grandmaster!
Sorry, It was a stupid hack but I had to try it out :)
Very impressive, thanks!
Switching the whole server to Linux is not as easy as upgrading it to later Windows versions. If CF would install an update to obtain UCRT or just simply switch to Windows10/11 or Winserver2016/2019/2022, it could prevent those bugs from happening by using UCRT-based mingw64 version of GCC, Clang and MSVC compiler instead of using MSVCRT version.