Discrepancy in performance depending on compiler

№	Пользователь	Рейтинг
1	tourist	4009
2	jiangly	3823
3	Benq	3738
4	Radewoosh	3633
5	jqdai0815	3620
6	orzdevinwang	3529
7	ecnerwala	3446
8	Um_nik	3396
9	ksun48	3390
10	gamegame	3386

№	Пользователь	Вклад
1	cry	165
2	maomao90	163
2	Um_nik	163
4	atcoder_official	161
5	adamant	160
6	-is-this-fft-	158
7	awoo	157
8	TheScrasse	154
9	nor	153
9	Dominater069	153

While trying to squeeze my solution of 1336E2 - Chiori and Doll Picking (hard version) into the time limit, I have encountered an unexpectedly large difference in the execution times among the various C++ compilers offered by Codeforces. Since I think it is something worth knowing, I am sharing this discovery.

Consider the following minimal working example (it generates $$$2N=30$$$ random $$$56$$$-bits numbers and computes the sum of the bits of the xors of all the subsets of the $$$30$$$ numbers).

Code

#pragma GCC optimize ("unroll-loops")
#pragma GCC target("sse3,sse4")

#include <assert.h>
#include <stdlib.h>

typedef unsigned long long ULL;

ULL rand_short() { // 8 bits
    return rand() & ((1<<8)-1);
}

ULL rand_ull() { // 56 bits
    ULL res = 0;
    for (int i = 0; i < 7; i++) res = (res<<8)|rand_short();
    return res;
}

int main() {
    srand(123);
    const int N = 15;
    
    ULL a[N], b[N];
    for (int i = 0; i < N; i++) a[i] = rand_ull(), b[i] = rand_ull();

    ULL c[1<<N], d[1<<N];
    for (int bb = 0; bb < (1<<N); bb++) {
        c[bb] = d[bb] = 0;
        for (int i = 0; i < N; i++) {
            if (bb&(1<<i)) c[bb] ^= a[i], d[bb] ^= b[i];
        }
    }
    
    ULL res = 0;
    for (int i = 0; i < (1<<N); i++) for (int j = 0; j < (1<<N); j++) {
        res += __builtin_popcountll(c[i]^d[j]);
    }
    assert(res == 30064771072);
}

The important lines are the following ones, where the functions __builtin_popcountll and xor are called $$$2^{30}$$$ times.

 ULL res = 0;
 for (int i = 0; i < (1<<N); i++) for (int j = 0; j < (1<<N); j++) {
     res += __builtin_popcountll(c[i]^d[j]);
 }

Executing the above program in Codeforces custom invocation yields these execution times:

 Compiler                         Execution Time
 GNU GCC C11 5.1.0                4040 ms
 GNU G++11 5.1.0                  4102 ms
 GNU G++14 6.4.0                  1123 ms
 GNU G++17 7.3.0                  1107 ms
 GNU G++17 9.2.0 (64bit, msys 2)  374 ms

Notice that the 64bit-native compiler produces a much faster executable (and notice also that among the other compilers there is quite a difference). Hence, next time you have to optimize a solution with a lot of bit-operations on 64bit integers (and, in fact, this situation is not so uncommon), consider using the compiler GNU G++17 9.2.0 (64bit, msys 2).

It might be that the differences among the execution times are due to the way I have written the program (maybe the wrong PRAGMAS? Maybe preventing some compilers from optimizing because of a certain idiom? Maybe something else?), if this is the case, please enlighten me!

Блог пользователя dario2994