[Tutorial] GCC Optimization Pragmas

→ Pay attention

Before contest
Codeforces Round 995 (Div. 3)
38:34:40
Register now »

→ Streams

Codeforces Round 995 Solution Discussion

By aryanc403

Before stream 40:54:40

View all →

→ Top rated

#	User	Rating
1	tourist	3985
2	jiangly	3814
3	jqdai0815	3682
4	Benq	3529
5	orzdevinwang	3526
6	ksun48	3517
7	Radewoosh	3410
8	hos.lyric	3399
9	ecnerwala	3392
9	Um_nik	3392

Countries | Cities | Organizations

View all →

→ Top contributors

#	User	Contrib.
1	cry	169
2	maomao90	162
2	Um_nik	162
4	atcoder_official	161
5	djm03178	158
6	-is-this-fft-	157
7	adamant	155
8	awoo	154
8	Dominater069	154
10	luogu_official	150

View all →

→ Find user

→ Recent actions

Detailed →

nor's blog

[Tutorial] GCC Optimization Pragmas

By nor, 3 years ago, In English

https://nor-blog.codeberg.page/posts/2021-10-26-gcc-optimization-pragmas/

optimization bad, optimize good

+854

nor
3 years ago
40

Comments (40)

Write comment?

ToxicPie9

3 years ago, # |

+332

as a co-author, i helped making the meme and the post.

please upvote if you find the post helpful. we might post more about technical stuff in the future.

→ Reply

hbarp

3 years ago, # |

← Rev. 2 →

+45

Finally a blog that explains behind the scenes for pragma spells black magic. Thanks ! :pkinglove:

→ Reply

sslotin

3 years ago, # |

+57

Very nice post. Most people in the competitive programming community are using pragmas very imprudently like if it was some magic, so demystifying them is extra important.

One thing to note is that most SIMD extensions are backwards compatible, so you only need to include the latest one instead of the sse,sse2,sse3... mess that people often use (except for the situational ones like popcnt). So just leaving the following two cover your needs in 95% of the cases:

#pragma GCC optimize("O3")
#pragma GCC target("avx2")

Also, it may or may not be beneficial to ask your compiler to tune your code for a specific architecture. GCC isn't very good at that, but it almost certainly won't make things worse:

#pragma GCC target("avx2,tune=native")

All in all, I guess for competitive programming this setup covers almost all and has a very low chance of backfiring:

#pragma GCC optimize("Ofast,unroll-loops")
#pragma GCC target("avx2,tune=native")

Also, I wrote a broader article on the topic, and it has some other compiler optimizations if someone is interested.

→ Reply

nor

3 years ago, # ^ |

← Rev. 2 →

+25

Thanks for the comments, the backwards compatibility in most people's code is quite a mess (hence it warranted its own place in the meme :)).

I tend to avoid tune=native since it sometimes breaks some stuff locally/on the judge (for instance, compilation error here: https://godbolt.org/z/53c877nbh ). I haven't tried that yet on CF though. Also I tend to not use Ofast since I've heard about people's code breaking on geometry problems, and a couple of times it led to much slower code than O3 (though it's quite rare, so it should be fine anyway).

Your book that you linked is pretty nice by the way, hoping to see more stuff like that!

→ Reply

Kyou_mo_kawaii

3 years ago, # ^ |

← Rev. 2 →

+24

You explained why sse, mmx etc are redundant because of avx2. But why omit the bit manipulation stuff like popcnt, lzcnt, abm, bmi, bmi2? Can you give an example of when these can "backfire"? Also it seems like fma is not always covered by avx2 either right?

I am tempted to use this as my default now:

// change to O3 to disable fast-math for geometry problems
#pragma GCC optimize("Ofast,unroll-loops")
#pragma GCC target("avx2,popcnt,lzcnt,abm,bmi,bmi2,fma,tune=native")

(and @nor, it would super useful if you had a correct copy-and-pastable snippet in your tldr. If I googled your blog post during the contest I would probably end up copying the first block of code I saw, which is the one that does nothing lol)

EDIT: sslotin I just started looking at your book and it's fucking amazing! I highly recommend anyone who was interested enough to click on this blog post to read it too.

→ Reply

nor

3 years ago, # ^ |

+21

I think tune=native should be used cautiously; you should check if it compiles.

Also regarding the copy-and-pastable snippet, I mentioned in the blog that it depends quite a bit on the specific judge machines, so it's not exactly copy-and-pastable. And if anything, it should already be in your template so you don't need to copy paste from this blog post :)

However I added the pragmas that I tend to use on Codeforces just for easy reference. Hopefully that helps.

→ Reply

sslotin

3 years ago, # ^ |

← Rev. 2 →

By backfire I mean to cause the program to either crash or slow down significantly, or in general to behave differently on your laptop and on the testing system (without necessarily causing runtime errors). The more hardware-specific options and optimizations are enabled, the higher the chances of that happening.

To be safe you need to find out the exact microarchitecture the server is running, look up the list of its extensions and intersect it with what you have locally. I did this for CodeForces a while back when I was curious, and found out it was running a Skylake or something very similar, which has these extensions listed in the docs:

"skylake": Intel Skylake CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA, BMI, BMI2, F16C, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT, XSAVEC and XSAVES instruction set support.

There are a lot of them, and I don't really know what a third of them does or which are included in which. The reason I recommend only leaving avx2 is because it is the only thing that is almost guaranteed to also be enabled on your workstation (unless you are an Apple M1 user) and also probably the only extension that really matters — I doubt anyone has ever been bottlenecked by the throughput of popcount, to be honest.

Actually, I think for CodeForces in particular this is what you should to do when submitting:

#pragma GCC target("arch=skylake")

That's it, nothing else except optimization flags. This not only enables all available extensions and only them, but also tells the compiler additional info about the exact microarchitecture such as instruction latencies (this option implies tune=skylake). But be aware that if you leave this pragma uncommented when testing locally, there is a small chance it will crash or behave differently unless you are also running a Skylake.

On some onsite contests like the ICPC WF it is guaranteed that the testing servers will be identical to the workstations, so you can query the platform with g++ -march=native -Q --help=target | grep march and replace skylake with that.

(I have not tested it though. Please someone try it and tell if I am wrong.)

→ Reply

nor

3 years ago, # ^ |

← Rev. 2 →

There's a quick way to check if the targets you're interested in are supported on an unfamiliar online judge: assert(__builtin_cpu_supports("<target-name>")); will throw a runtime error if the CPU doesn't support that target.

For an onsite contest, since you would have access to machines with the same configuration as the judging machines, it's nicer to run g++ -march=native -E -v - </dev/null 2>&1 | grep cc1 and see which targets work (-mno-avx means avx is not supported, -mavx means avx is supported).

Specifying architectures inside pragmas is kind of hopeless on GCC, for instance, #pragma GCC target("arch=skylake") (or anything trying to pass arch/tune parameters to the target) gives a compilation error: https://godbolt.org/z/K5xTeG3Pv

→ Reply

sslotin

3 years ago, # ^ |

← Rev. 2 →

Specifying architectures is kind of hopeless on GCC

Look at the errors: attribute value 'arch=skylake' was already specified in 'target' attribute. Some concurrency-related parts of the GCC C++ library are using exactly this method for performance, but I guess they are conflicting with ours because you can't specify architecture twice or something.

If we put that pragma after the library import this seems to work: https://godbolt.org/z/dWYPjfnWP. Not sure which parts of the standard library will be optimized in this case though.

UPD: it also works without changing anything in older (≤8) GCC versions.

→ Reply

nor

3 years ago, # ^ |

Yeah, as far as I can see, it suffices to keep only iostream before the import to avoid the compilation error. Too bad that bits/stdc++.h doesn't work here.

I tried making sure that some STL functions get optimized, but unfortunately couldn't manage to do that: https://godbolt.org/z/rh59TGbPE (so as of now I can't find a way to use arch=skylake fruitfully).

→ Reply

sslotin

3 years ago, # ^ |

I think this is a bug in GCC 9. When defining hardware-specific library implementations, you are supposed to use #pragma GCC push_options with target, implement a function or a module, and then pop it back (which is a way to do target-inside-target), but in that one file that causes the error it is not done like that.

It is easy to fix, but for already installed compilers there is nothing we can do unfortunately. Still works on the C++17 CF compiler and most other platforms though.

→ Reply

ExplodingFreeze

3 years ago, # ^ |

Polygon claims to judge on the i3-8100, so targeting Coffee Lake might be ideal. Though as far as I know Coffee Lake has no drastic architectural changes from Kaby Lake or Skylake, so it may not lead to a noticeable benefit over targeting Skylake.

→ Reply

sslotin

3 years ago, # ^ |

← Rev. 2 →

I did some investigation, and yes, you are right: it is in fact Coffee Lake. And it shows the same model number running the gist I linked above.

![](https://i.ibb.co/rQn889h/Screenshot-from-2021-10-27-21-40-59.png)

However, as you noted, there aren't many architectural changes from Kaby Lake or Skylake. In fact, in terms of optimization-relevant features, there aren't any at all, so GCC does not distinguish between them and simply groups them all as "skylake".

→ Reply

Savior-of-Cross

3 years ago, # |

+62

Norosity

→ Reply

SPyofgame

3 years ago, # |

+15

As if there is a second upvote button, I would smash it to this blog.

→ Reply

Duy_e

3 years ago, # ^ |

+31

smash yourself instead :>

→ Reply

SPyofgame

3 years ago, # ^ |

No thank you. I already smashed my rating down to pupil few days ago :>

→ Reply

Priyaanshut

3 years ago, # |

+20

norz

→ Reply

KingRayuga

3 years ago, # |

norz

→ Reply

hbarp

3 years ago, # ^ |

+11

Lets not forget ToxicPie9 also co-authored the blog !

→ Reply

KingRayuga

3 years ago, # ^ |

+18

ToxicPie9 orz

→ Reply

hsekharsahoo

3 years ago, # |

+11

norzosity

→ Reply

sslotin

3 years ago, # |

← Rev. 3 →

+36

Let's petition MikeMirzayanov to simply add -march=native (or -march=skylake or whichever architecture invokers are running on) to the compilation flags the next time he adds a new C++ compiler (preferably Clang). Then all the target pragmas can be omitted completely, only leaving optimization options which are situational.

This ensures that the compiler knows everything about the hardware while not increasing compilation time — in fact, it will probably save a lot of CPU resources (on average by 10-20% on all submissions would be my guess). I have no idea why it is not done so on all online judges.

→ Reply

prabhav_

3 years ago, # |

Heartfelt thanks from the bottom of my heart.

→ Reply

codemastercpp

3 years ago, # |

nor orz :prayge:

→ Reply

ssvb

3 years ago, # |

+24

About AVX2. These instructions are also notorious for downclocking the CPU, depending on the number of CPU cores that are using AVX2 simultaneously. The exact effect of this varies depending on the CPU model. See https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/ (it's mainly about AVX-512, but AVX2 also has similar properties to some extent).

So on the Codeforces platform, the AVX2 pragmas may be potentially not very neighbour-friendly among other things. Imagine a quadcore judge machine, which processes 4 submissions in parallel. If 3 of them are actively using AVX2, then it's possible that the CPU may downclock and affect the performance of the unsuspecting 4th non-AVX2 submission too. Again, this may be a non-issue depending on the CPU model, but it would be great if Codeforces could try to do some experiments and benchmarks in this scenario.

The best solution may be just to disable AVX2 instructions on the judge machines.

→ Reply

ExplodingFreeze

3 years ago, # ^ |

← Rev. 3 →

As far as I'm aware AVX2 is nowhere near the power hog that AVX512 is.

See this analysis for some rough numbers in some CPU limited workloads. It is for rocket lake, so its applicability to most judges that are still using the skylake-like architectures isn't 1:1, but it still paints a rough picture of the effect of AVX/AVX2 on power draw.

Also I would highly expect that any decent judge would completely disable core turbos and/or take other measures to mitigate noisy neighbor issues even in the case of full loading of across all cores. And I do believe no judge right now supports AVX-512 for this exact reason.

→ Reply

priyansh_34

3 years ago, # |

← Rev. 5 →

Hi nor and ToxicPie9,

In the problem CF Edu. Round Problem B. I faced an issue, first I wrote the code without using any pragmas and it was accepted. 133543512

Then I tried with some pragmas

#pragma GCC optimize("Ofast,unroll-loops") #pragma GCC target("avx2,popcnt,lzcnt,abm,bmi,bmi2,fma,tune=native")

On using them it displayed wrong in testcase 6.133542940.

Can you or anyone please explain, why this was happened. Any help will be very helpful. Thanks

PS: The error was caused by 2nd pragma, but I don't know why that was happening.

→ Reply

ToxicPie9

3 years ago, # ^ |

← Rev. 3 →

+26

~~I think the issue is unrelated with pragmas;~~ the solution should be incorrect even without them. Update: it is indeed related to pragmas in a weird way, see the comment below

On line 88, you calculated the answer of ceil(rem / k) by dividing two doubles. However, the double type can only store 52 bits of precision. Since both numbers can be close to 1e18 in magnitude, when you convert them to doubles, you might lose some precision and their values might change. (For example, if you convert the value 10000000000000001 ($$$10^{16}+1$$$) into double, it becomes 10000000000000000 ($$$10^{16}$$$).) If one of the values rem or k changes during conversion, you are likely to get a wrong answer.

The more precise way to calculate the answer is to do integer division instead -- for positive integers $$$a, b$$$, the exact result of $$$\lceil\frac{a}b\rceil$$$ can be calculated in C++ with:

a % b == 0 ? a / b : a / b + 1
a / b + (a % b != 0)
1 + (a - 1) / b

~~In fact, I do not understand why one of your solutions got accepted... Maybe the compiler did something suspicious? There's no issue on my end :thonk:~~

→ Reply

priyansh_34

3 years ago, # ^ |

+16

Thanks ! For your nice and to the point explanation <3.

→ Reply

ToxicPie9

3 years ago, # ^ |

+16

Update: Seems like it is caused by an "excess precision" issue that happens on specific compilers on Codeforces. See https://codeforces.net/blog/entry/78161 for more details.

After some testing, only "GNU G++17 (7.3.0)" on Codeforces seem to make the code use extra precision and output the correct answer. In this case pragmas is actually related to the problem: targets like AVX2 support 64-bit floating points, which fixes the excess precision issue and causes the submission to get WA correctly.

→ Reply

Kyou_mo_kawaii

3 years ago, # |

Are there any useful pragmas for controlling stack frame sizes in recursive functions for optimizing MLE? (today's div3's 1607F - Robot on the Board 2 was a good example, and also this past discussion)

It doesn't seem like #pragma GCC optimize("Os") does much. Looking at the list of gcc optimization flags, it does seem like this is something you can control but I don't know how to specify them as pragmas.

From: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html -fconserve-stack Attempt to minimize stack usage. The compiler attempts to use less stack space, even if that makes the program slower. This option implies setting the large-stack-frame parameter to 100 and the large-stack-frame-growth

→ Reply

nor

3 years ago, # ^ |

You can use #pragma GCC optimize("conserve-stack") for that. To check if this pragma works, you can use the compiler flags mentioned in the blog. In general, if you have an option like -fsomething, it makes sense to try #pragma GCC optimize("something") and see if it works.

→ Reply

Kyou_mo_kawaii

3 years ago, # ^ |

Thanks! Unfortunately it doesn't seem like conserve-stack did anything for my problem. I want to try the other flags on that page too but I still can't figure out the pragma syntax for stuff like --param=large-stack-frame-growth=100.

In particular I am trying to get a recursive solution for 1607F to pass. I did get it working for C++17 32 bit, but not for 64 bit (neither C++20 nor C++17): https://codeforces.net/contest/1607/submission/134144039

Do you have any tips?

→ Reply

Honey_Badger

3 years ago, # |

Why didn't you include tzcnt target in your pragmas list?

→ Reply

nor

3 years ago, # ^ |

+24

tzcnt is an instruction that is supported in the bmi instruction set, so it isn't needed to be mentioned separately. Also, the last I checked, it gave me an unrecognized target error.

→ Reply

mukulrawat_03

2 years ago, # |

Can anyone explain which pragma function causes error and how ?

Solution 1-> Pragma on https://codeforces.net/contest/1737/submission/175313327

Solution 2-> Pragma off https://codeforces.net/contest/1737/submission/175313262

→ Reply

nor

2 years ago, # ^ |

It is probably due to some pragma changing how excess precision (in this case) is handled by the C++ compiler on Codeforces you are using (which is 32-bit, and there are a lot of weird things that happen with precision with the Codeforces 32-bit C++ compiler), but I might be wrong. It is probably because the compiler could be performing some optimizations in the code without pragmas but not the one which has the pragmas (the pragmas might be telling the optimizer to not optimize that code but do some other optimization instead), that it is able to keep extra precision in the code without the pragmas.

Ideally, your code should have gotten WA both times, since you are using std::sqrt on a long long which causes loss of precision. So there is nothing wrong with the pragmas (you were just relying on a compiler specific behavior). Submitting on a 64-bit compiler validates the above theory, the relevant submission can be found here: 175318712.

→ Reply

mukulrawat_03

2 years ago, # ^ |

Thanks

→ Reply

ScorpioDagger

2 years ago, # |

Very informative elaborating what is happening underneath the hood. I reached this blog from google. It helped me get this 552D - Vanya and Triangles from 1600 ms to 1270 ms using a brute-force approach.

→ Reply