ToxicPie9's blog

By ToxicPie9, 5 weeks ago, In English

This is my answer to https://codeforces.net/blog/entry/137484, in which -is-this-fft- asked a bizarre question. This answer is too big for a comment so I made a blog post instead.

The question

The submission 297353626 behaves normally with 180 MiB memory usage while 297352244 got a runtime error due to stack overflow (with 256 MiB stack). Both submissions have identical code, don't seem to contain bugs, and an assertion exists to make sure the recursion is at most $$$2\times10^5$$$ layers deep. What's happening?

TL;DR

Notice that the solution with stack overflow uses GCC 7 (32-bit) while the accepted ones use GCC 13 (64-bit). This is not a coincidence.

The main reason of unreasonable stack usage boils down to two points: GCC 7, and 32-bit.

32-bit

First of all, the code is compiled and run in 32-bit mode. Compiling the submission with GCC 7.5 in 32-bit on godbolt shows that the long long solve(long long, long long) function has a gargantuan stack frame, seen from its prelude:

_Z5solvexx:
    push    ebp
    push    edi
    push    esi
    push    ebx
    sub     esp, 1948

This means that each call to solve uses $$$8\times2+4+4\times4+1948=1984$$$ bytes of stack space (2 8-byte arguments + 4-byte return address + 4 4-byte registers + 1948 bytes for stack variables). In a $$$2\times10^5$$$-layer recursion, the stack usage would be about 378 MiB, which probably overflows the stack. However, solve looks like a very simple function. Why does it need that much memory?

In 32-bit x86, there are less registers to use than in 64-bit mode, and each register only stores 32 bits. In the submission, every integer variable is a 64-bit long long, including loop indices, answer values, etc. Since 64-bit arithmetic cannot be done directly, all those variables have to be stored somewhere on the stack, and when they are used, only half of a variable (32 bits) is pulled out and processed each time. If 64-bit intermediate values are required in the middle of a computation, they must be stored as well.

If we count all the variables, including temporary values used in statements, it can be seen that they use a good amount of space (at least a couple hundred bytes) but nowhere close to 1948 bytes. What else is happening?

cdecl my beloved

Another disastrous part comes from function calling. Let's take the function long long solve(long long, long long) as an example.

In 64-bit x86, the main calling conventions used in C are Microsoft x64 and System V. Since there are many registers, the first few (small) arguments and the return value can be put in registers for faster access. In both conventions, the 2 64-bit arguments are placed in 2 registers, and the 64-bit return value is stored in a register.

In 32-bit, the main calling convention in C is cdecl. The 16 bytes of function arguments are pushed onto the stack in reverse order, as well as an extra pointer. This extra pointer points to 8 bytes of stack-allocated space and is used to store the return value. This causes every solve call itself to use 24 more bytes of stack space.

The problem gets worse when functions get a bit bigger, for example:

  • pair<long long, long long> op_min(pair<long long, long long> p, pair<long long, long long> q): in 64-bit, a pair is stored in 2 registers for both arguments and the return value, so no extra stack is used; in 32-bit, they are all on the stack, taking up 48 bytes.
  • pair<long long, long long> Segtree::query(Segtree *this, ll ql, ll qr, ll u): left as exercise for the reader.

GCC

You might think compiler optimizations help by optimizing register usage, reusing stack space, etc., but in this case, GCC actually made the problem a lot worse.

Looking at the compiled code closely, you can see that the solve function has too much calls to Segtree::query. 28, to be exact. This is because of function inlining — to reduce the overhead of function calls, compilers sometimes replace a function call with the code of the function itself. This can be done recursively, for example if the query calls in the query function were to be replaced by itself twice, it results in a large function with 8 query calls.

With inlining, the short solve function gets expanded into a lot of code. Each function call and its surrounding code requires stack space to operate, and it seems that the older GCC 7 is less efficient at utilizing stack space for different parts. For example, it can be seen that the results of different query or op_min calls are all stored in variables in different places, but their values are not used together. While the optimization seems to work well for speed, it multiplies the stack usage problems from previous sections.

GCC 14 does the optimizing job a lot better. In 32-bit mode, a lot of code is inlined too, but stack variables are reused more efficiently, and it uses only 380 bytes instead of 1948. In 64-bit mode, even more stack space is saved, requiring only 88 bytes.

Conclusion

It's almost 2025, consider using something newer than GCC 7 (32-bit) on Codeforces. Also to the clown developers/end users who keep shipping/running 32-bit executables on Windows for "compatibility issues" or other reasons: please stop.

Full text and comments »

  • Vote: I like it
  • +337
  • Vote: I do not like it

By ToxicPie9, 11 months ago, In English

TL;DR

Currently people are discussing a slowdown bug on Codeforces that seems to happen randomly, and can cause code to run 100x slower and get TLE. More details in pajenegod's blog post.

In this article, I present a mitigation: add the following to your code.

#include <windows.h>
void *operator new(size_t size) {
    if (void *ptr = HeapAlloc(GetProcessHeap(), 0, size ? size : 1)) return ptr;
    throw std::bad_alloc{};
}
void operator delete(void *ptr) {
    HeapFree(GetProcessHeap(), 0, ptr);
}
If you use malloc/free in C++ (you shouldn't), also change them.
If you use aligned new/delete in C++, also change them.
DISCLAIMER

TL;DR, part 2

This section is for Codeforces admins including MikeMirzayanov.

I have very high confidence that some poorly implemented functions in Microsoft's C runtime is the root cause of the bug which is outside contestants' control, creating unfair disadvantages in contests. The bug has likely existed since C++17 (64-bit) first came out on Codeforces.

Every C++ submission on Codeforces (and also the entire Codeforces infrastructure) currently depends on that library, on top of the internals of Windows that very few people has any clue about. Both of these are infamous for having bad implementations in many places, and I don't think it's a good idea to rely on those for one of the online judges with the most users.

I ask Codeforces developers to seriously consider other, more reliable possibilities, for example an option for judgment on Linux machines, possibly using one of the plenty judge server implementations already available. An increased stability and robustness would hopefully make contest and judging experience on Codeforces a lot better.

What's happening?

There has been a bug on Codeforces causing random TLEs recently. Sometimes when a seemingly unrelated line of code is added, or the order of two lines are swapped, or a vector size is changed, etc., the code suddenly becomes 100x slower on certain inputs. The bug got some attention because of its unexplained behavior, and the fact that many submissions (include tourist's) that should have passed a problem got TLE or were hacked because of this. An example showing the bug (by kostia244):

#include <bits/stdc++.h>

using namespace std;

int main() {
    vector<vector<int>> TLE(40000, vector<int>(7));

    string s;
    for (int i = 0; i < 17; i++) s += to_string(i) + " ";

    for (int j = 0; j < 10000; ++j) {
        istringstream ss(s);
        int x;
        while (ss >> x);
    }
}

A loop of about 10000 * 17 iterations takes 1.5s to complete. The weird and funny thing about the bug is that if 40000 or 7 is changed slightly, or if the vector is moved to the middle, then nothing happens and the code runs normally. This peculiar behavior indicates something is buggy and TLE is not caused by the code's authors. It only happens on Codeforces, and only with 64-bit C++. No one was able to reproduce it anywhere else or explain why.

After pajenegod also posted a blog on the situation, people in the comments (drdilyor, bkbtpout, et al.) quickly discovered that it's only the (de-)allocations that matter. Here's my attempt at reducing the code to only new and delete:

#include <bits/stdc++.h>
using namespace std;

void *alloc(size_t s) {
    auto c = new char[s];
    asm volatile("" ::"r"(c));
    return c;
}

int main() {
    auto c = alloc(0x1c);
    for (int i = 0; i < 40000; i++) {
        alloc(0x1c);
    }
    delete[] c;
    for (int i = 0; i < 100000; i++) {
        delete[] alloc(0x1c);
    }
}

This consistently takes around 900ms in both C++17 (64) and C++20 (64).

An overview of what I investigated in the past few hours

The vector<int>(7) in one case immediately caught my attention, because it makes an allocation of 28 bytes. 28 is special because it's $$$32 - 4$$$, where $$$32$$$ is a common chunk size of various memory allocation algorithms, and $$$4$$$ is the size of a pointer used to store chunks (if you didn't know, on Codeforces malloc uses 32-bit libraries and returns 32-bit pointers, even with 64-bit C++. you are scammed). This combined with the fact that changing the 7 makes the bug disappear, hints that something is wrong with allocations.

By default new and delete with GCC simply call the malloc and free functions from the C standard library. I tried adding some time measurements to see if malloc is really at fault. And surely enough, it is.

Code
Result in custom test

Knowing that it's not caused by the C++ STL themselves, but the allocation functions called when creating them, debugging became a little easier.

With my superficial knowledge of heaps, I can quickly guess what is going on. The 40000 mallocs eats the top chunk to a specific size, then the first free sends a chunk into the tcache, or the 0x20 fastbin if that's not a thing. Then the next sequence of malloc/frees keep taking and returning the same chunk, and this combined with the 40000 mallocs somehow breaks the allocation algorithm, making it take a lot more time, is what happened, right? Oh, wait. uh. no...

...wait. WAIT. I forgor 💀

The above is what might happen in GNU's C library. Codeforces runs on Windows, using Microsoft's C library. What does that do?

Codeforces's C++20 (64) uses WinLibs, which uses either MSVCRT (Microsoft Visual C++ Runtime) or UCRT (Universal C Runtime) as the C standard library. We don't know what exact setup Codeforces has, but we can first investigate what each library does. Fortunately, the source code of both are available.

Let's look at UCRT first. In non-debug mode, malloc simply calls _malloc_base, which is just a wrapper to the Windows API HeapAlloc. Similarly free calls HeapFree. These memory management functions are provided by the operating system, could they be the source of bugs? We can test this by replacing malloc and free calls by the API calls directly.

Test code

This runs normally, which means that UCRT is unlikely the reason of the bug.

Then we look at MSVCRT's source code. It also uses Windows API functions, except when it doesn't. Depending on the system version, there are 2 more implementation of memory allocation algorithms, V6_HeapAlloc and V5_HeapAlloc, which seem to allocate memory by looking through a linked list for free chunks. I did not have enough time or energy to read the implementation, but it is possible that the function has bad performance in certain cases.

Then I found this interesting article, which shows that MSVCRT's malloc might really be some terrible implementation, and Codeforces users are not its first victims.

If we can know for sure which C library C++20 (winlibs) and C++17 (msys2) use, we can basically pin the possibility down to one bad function in a library. However, it's difficult to get the exact setup on Codeforces even with custom test. If only there was a way to download compiled binaries from Codeforces... wait, there is. On Polygon, full problem packages include compiled versions of checkers, validators, etc. I can simply add a generator to a random problem and download the package to obtain an executable file (although only C++17). Fortunately, it has some debug info, and is statically linked.

After briefly inspecting the 3MB, impossible-to-reverse-engineer PE32+ executable, I can confirm that it indeed uses MSVCRT's malloc. Now we can basically be sure that the cause of the weirdest slowdown bugs in Codeforces history is the poor implementation of the C runtime.

A possible fix?

Now we know the exact issue that caused the mysterious slowdowns: Microsoft's astronomically disastrous dumpster fire C library. How to fix this? It's easy, just replace it with memory management libraries that are known to work well, for example jemalloc, mimalloc (ironically made by MS themselves to replace their library functions they themselves know are terrible), or ptmalloc. One could also replace the entire library with, say, UCRT or GNU libc.

But as regular Codeforces users we can't control what libraries we use; only Codeforces is able to change that. However, we can rewrite our code to not use the C library to allocate at all. It's possible to do it by calling raw Windows API functions, as shown above. The following is copy-pasted from the TL;DR section:

#include <windows.h>
void *operator new(size_t size) {
    if (void *ptr = HeapAlloc(GetProcessHeap(), 0, size ? size : 1)) return ptr;
    throw std::bad_alloc{};
}
void operator delete(void *ptr) {
    HeapFree(GetProcessHeap(), 0, ptr);
}
If you use malloc/free in C++ (you shouldn't), also change them.
If you use aligned new/delete in C++, also change them.
DISCLAIMER

Does this fix the slowdown bug? No one will likely ever know, because unlike Microsoft's C libraries or STL, the lack of source code (and the sheer complexity) makes the Windows API impossible to understand. But at least we now know the bug is probably caused by library functions and can pray that skipping the library solves it.

In a way you could say that the bug is not Codeforces's fault... but it's not really not their fault either, as there is a very simple solution (that would also solve countless other issues reported by users before): run GNU/Linux, like literally every other online judge of considerable scale.

Update: About hacks

After the bug was discovered, people are hacking submissions by using special inputs that cause a certain sequence of allocations to happen. Since this is a problem of the library and not of contestants' codes, I don't think it's fair doing that.

A similar case is hacking bad std::unordered_map, compared to, e.g., hacking bad rolling hashes. The latter is a problem with contestant implementations, and I enjoy hacking them. But the former is just bad implementation from C++ STL, and isn't really the contestants' fault, so I don't usually hack them. This problem is more well-known though, and has many simple fixes available, so it's probably OK to hack them.

Before this slowdown bug and its fixes are well-known, I think hacking using the bug is equivalent to exploiting n-day vulnerabilities on Codeforces to manipulate verdicts, and I'm not very supportive of that. Let's hope a patch is deployed soon.

Afterwords

I'm not touching Windows or PE for the next 6 months. Leave a comment if you play CTFs and understand my feeling.

Subscribe for more posts like this where i solve CP bugaboos with the most absurd techniques

Full text and comments »

  • Vote: I like it
  • +794
  • Vote: I do not like it

By ToxicPie9, 22 months ago, In English

Having a special joke contest on April 1st each year has become a tradition on Codeforces. Usually, April Fools contests are prepared by Nickolas. However, last year there wasn't an official one, and a contest was made by Agnimandur and magnus.hegdahl instead.

Since April 1st is coming soon, I want to ask: Is there a plan for Codeforces to hold April Fools Day Contest 2023?

If an official April Fools contest is not planned this year, a group of friends and I (AlperenT, Ari, BucketPotato, flamestorm, ScarletS, ToxicPie9) already have a lot of problem ideas and are ready to prepare a round (we will need help from MikeMirzayanov to make it happen).

Full text and comments »

  • Vote: I like it
  • +218
  • Vote: I do not like it

By ToxicPie9, 2 years ago, In English

Yesterday YouKn0wWho posted a blog explaining common mistakes in competitive programming and how to avoid them. I was greatly inspired by that post so I decided to make my own.

I have compiled some of the mistakes that I didn't make in my early Competitive Programming phase. I also mentioned how to avoid them. Also, in most cases, I will give you a chance to find out what the bug is before I reveal the culprit as I tried to make this blog interactive. The codes that I have used in this blog have been written in Rust as it is the most beloved language for CP.

Mistake 1

Check out the following code:

Code

The output should be $$$10^{18}$$$. But if you run the code, you will get a different output. Why?

Reason

Mistake 2

Check out the following code:

Code

Try to run this locally. What is the time complexity of this?

Is it $$$O(n)$$$?

Reason

Mistake 3

Check out the following code:

Code

Notice that it is guaranteed that total sum of n is <= 100000. So how many operations will the code take in the worst case?

Reason

Mistake 4

What is happening in the following code?

Code

The output is supposed to be [1, 1, 1, 1, 1]. But it's not the case actually! Why?

Reason

Mistake 5

Don't use endl! If your code needs to print millions of newlines, then using endl turns out to be really slower than using '\n'. Why?

Reason

Mistake 6

Use pow() function for integer calculations.

Why?

Mistake 7

Run the following code

Code

You might expect the output to be $$$-1$$$. But the output is actually not! Why?

Reason

Mistake 8

Using eprintln! might be a good way to debug your code as it doesn't output to the standard output. But leaving the eprintln! instances in your code while submitting in OJ might be one of the worst ways of getting TLE.

Smash me for more info.

Mistake 9

Look at the following code

Code

The output is $$$0$$$, which is correct. Why?

Reason

Mistake 11

Consider the following code for calculating the maximum occurrence in an array.

Code

This code seems like it can be hacked easily but in fact, for almost every valid input, it should get AC.

Why?

Mistake 12

Run the following code:

Code

What will be the size of the map? $$$5$$$?

Check

Mistake 13

I forgot Mistake 10.

Mistake 14

Check this out.

Code

What is the time complexity of this?

Check

More Mistakes and Non-Mistakes

  • Do not insert or erase from a container (Vec, HashSet etc) while traversing it using for x in s.iter() like syntax at the same time. This is because the compiler won't let you do that. In Rust, you are not allowed to borrow a variable while it is also borrowed as mutable at the same time (e.g. when inserting).
  • Create variables only when you need them instead of just declaring let a, b, c, d, e, f, g, h; and 69 other variables at the beginning of your code! This is because Rust's grammar doesn't let you do that.
  • If you want to count the number of set bits in an i64 number, then use the count_ones() method instead of the __builtin_popcount function. This is because there is no __builtin_popcount function.
  • If you want to compute the square root of an f64 number, then use the sqrt() method instead of sqrt() because sqrt() function takes self as input whereas sqrt() takes self.
  • Speaking of Runtime Error, the most likely case of getting Runtime Error is when your code encounters an error while running.
  • let x: i64 = 1 << 40; will not overflow as Rust will try to deduce the types of integer literals if they're not specified. In this case, the compiler determines that 1 is i64.
  • Don't accidently write your code in C++.

Thanks for reading the blog. Feel free to add more mistakes that are common in CP in the comment section.

See you on a later episode, friend blobheart!

P.S. Although this is mostly a joke post, I tried to be accurate about the facts and did not intentionally put incorrect information in it (which means you may treat it as a somewhat educational blog). It also aims to showcase some features of Rust that helps you avoid some common CP mistakes in languages like C++.

Full text and comments »

  • Vote: I like it
  • +337
  • Vote: I do not like it

By ToxicPie9, history, 2 years ago, In English

Hello Codeforces!

I have created a Discord server for the upcoming ICPC World Finals in Dhaka, so the finalists can have a place to chat with each other.

You can join it with the link: https://discord.gg/p3W9YCZ3AK

Full text and comments »

  • Vote: I like it
  • +54
  • Vote: I do not like it

By ToxicPie9, history, 3 years ago, In English

sus has just overtaken Errichto and became the second top contributor on Codeforces. During the last few days, his contribution points skyrocketed at a terrifying rate.

Will sus eventually top Monogon and become the new ruler of Codeforces? Or is the power of Monogon's comments simply unmatched by mere mortals?

meme

Full text and comments »

  • Vote: I like it
  • +14
  • Vote: I do not like it