MaxBuzz's blog

By MaxBuzz, 14 years ago, In English
The following code produces strange results while compiling it under GCC with different optimization levels:
  • gcc source.cpp -> 0.440 s
  • gcc -O2 source.cpp -> 2.750 s (-O, -O1, -O2 the same)
  • gcc -Os source.cpp -> 0.223 s
For N=500, it is as follows:
  • gcc source.cpp -> 3.931 s
  • gcc -Os source.cpp -> 2.704 s
  • gcc -O2 source.cpp -> 42.142 s
The setup is GCC 4.4.4 on 64-bit Gentoo Linux.

Somehow optimizations by speed significantly slow the code, while optimizations by size speed it up :-)

Could anybody compile and test the code on your machines? Or, possibly, explain why it is like this?
  • Vote: I like it
  • +10
  • Vote: I do not like it

14 years ago, # |
  Vote: I like it 0 Vote: I do not like it
My results are as expected:

  • gcc source.cpp -> 0.545s
  • gcc -O2 source.cpp -> 0.421s
  • gcc -Os source.cpp -> 0.466s
For N=500, it is as follows:
  • gcc source.cpp -> 5.758s
  • gcc -Os source.cpp ->5.138s
  • gcc -O2 source.cpp -> 4.949s

gcc version 4.2.1 (Apple Inc. build 5666) (dot 3)
Target platform appears to be "x86_64", but the CPU itself is 32-bit.
14 years ago, # |
  Vote: I like it +8 Vote: I do not like it
Here are my results for N=200. (gcc 4.4.3, Ubuntu 32bit).

g++  0.899s
g++ -O2 4.733s
g++ -Os 0.413s
g++ -O2 -fno-tree-ter 0.390s

One would think that the optimization ftree-ter is broken. However it seems that it's enabled at -Os as well. In fact, the only difference in optimizations between -O2 and -Os is -finline-functions at my system. I tried turning it on, but to no effect.

Here's the relevant part of the man page:
-ftree-ter
Perform temporary expression replacement during the SSA->normal phase.  Single
use/single def temporaries are replaced at their use location with their
defining expression.  This results in non-GIMPLE code, but gives the expanders
much more complex trees to work on resulting in better RTL generation.  This is
enabled by default at -O and higher.
14 years ago, # |
Rev. 3   Vote: I like it 0 Vote: I do not like it
In reply to adamax.

Probably, this is the key. Seems to be that this results in copying of strings before comparison. As you can see, the slowdown of plain -O2 seems to be not constant, but asymptotical. I will check this when I reach home.

[Update] I was telling nonsense about asymptotics.
  • 14 years ago, # ^ |
    Rev. 2   Vote: I like it +5 Vote: I do not like it
    I disassembled the code of string::operator==. Turns out that in case of -O2 it uses the assembler instruction repz cmpsb, while in other cases it calls the system function memcmp. I found the description of this issue here. Quote:
    "in the -O0 case, GCC relies on the implementation
    of memcmp supplied with the C library. In the -O2 case, GCC instead uses its built-in implementation of memcmp. The built-in function uses the special IA-32 instruction repz cmpsb, which is known to be slow on modern hardware."
    Apparently switching off builtins (-fno-builtin) should fix the issue as well.

    And Bugzilla link.