如果我根据大小而不是速度进行优化，为什么 GCC 生成的代码要快15-20% ？

I first noticed in 2009 that GCC (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os) instead of speed (-O2 or -O3), and I have been wondering ever since why.

I have managed to create (rather silly) code that shows this surprising behavior and is sufficiently small to be posted here.

const int LOOP_BOUND = 200000000;

__attribute__((noinline))
static int add(const int& x, const int& y) {
    return x + y;
}

__attribute__((noinline))
static int work(int xval, int yval) {
    int sum(0);
    for (int i=0; i<LOOP_BOUND; ++i) {
        int x(xval+sum);
        int y(yval+sum);
        int z = add(x, y);
        sum += z;
    }
    return sum;
}

int main(int , char* argv[]) {
    int result = work(*argv[1], *argv[2]);
    return result;
}

If I compile it with -Os, it takes 0.38 s to execute this program, and 0.44 s if it is compiled with -O2 or -O3. These times are obtained consistently and with practically no noise (gcc 4.7.2, x86_64 GNU/Linux, Intel Core i5-3320M).

(Update: I have moved all assembly code to GitHub: They made the post bloated and apparently add very little value to the questions as the fno-align-* flags have the same effect.)

Here is the generated assembly with -Os and -O2.

Unfortunately, my understanding of assembly is very limited, so I have no idea whether what I did next was correct: I grabbed the assembly for -O2 and merged all its differences into the assembly for -Os except the .p2align lines, result here. This code still runs in 0.38s and the only difference is the .p2align stuff.

If I guess correctly, these are paddings for stack alignment. According to Why does GCC pad functions with NOPs? it is done in the hope that the code will run faster, but apparently this optimization backfired in my case.

Is it the padding that is the culprit in this case? Why and how?

The noise it makes pretty much makes timing micro-optimizations impossible.

How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source code?

UPDATE:

Following Pascal Cuoq's answer I tinkered a little bit with the alignments. By passing -O2 -fno-align-functions -fno-align-loops to gcc, all .p2align are gone from the assembly and the generated executable runs in 0.38s. According to the gcc documentation:

-Os enables all -O2 optimizations [but] -Os disables the following optimization flags:
  -falign-functions  -falign-jumps  -falign-loops <br/>
  -falign-labels  -freorder-blocks  -freorder-blocks-and-partition <br/>
  -fprefetch-loop-arrays <br/>

So, it pretty much seems like a (mis)alignment issue.

I am still skeptical about -march=native as suggested in Marat Dukhan's answer. I am not convinced that it isn't just interfering with this (mis)alignment issue; it has absolutely no effect on my machine. (Nevertheless, I upvoted his answer.)

UPDATE 2:

We can take -Os out of the picture. The following times are obtained by compiling with

-O2 -fno-omit-frame-pointer 0.37s
-O2 -fno-align-functions -fno-align-loops 0.37s
-S -O2 then manually moving the assembly of add() after work() 0.37s
-O2 0.44s

It looks like to me the distance of add() from the call site matters a lot. I have tried perf, but the output of perf stat and perf report makes very little sense to me. However, I could only get one consistent result out of it:

-O2:

 602,312,864 stalled-cycles-frontend   #    0.00% frontend cycles idle
       3,318 cache-misses
 0.432703993 seconds time elapsed
 [...]
 81.23%  a.out  a.out              [.] work(int, int)
 18.50%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]
 [...]
       ¦   __attribute__((noinline))
       ¦   static int add(const int& x, const int& y) {
       ¦       return x + y;
100.00 ¦     lea    (%rdi,%rsi,1),%eax
       ¦   }
       ¦   ? retq
[...]
       ¦            int z = add(x, y);
  1.93 ¦    ? callq  add(int const&, int const&) [clone .isra.0]
       ¦            sum += z;
 79.79 ¦      add    %eax,%ebx

For fno-align-*:

 604,072,552 stalled-cycles-frontend   #    0.00% frontend cycles idle
       9,508 cache-misses
 0.375681928 seconds time elapsed
 [...]
 82.58%  a.out  a.out              [.] work(int, int)
 16.83%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]
 [...]
       ¦   __attribute__((noinline))
       ¦   static int add(const int& x, const int& y) {
       ¦       return x + y;
 51.59 ¦     lea    (%rdi,%rsi,1),%eax
       ¦   }
[...]
       ¦    __attribute__((noinline))
       ¦    static int work(int xval, int yval) {
       ¦        int sum(0);
       ¦        for (int i=0; i<LOOP_BOUND; ++i) {
       ¦            int x(xval+sum);
  8.20 ¦      lea    0x0(%r13,%rbx,1),%edi
       ¦            int y(yval+sum);
       ¦            int z = add(x, y);
 35.34 ¦    ? callq  add(int const&, int const&) [clone .isra.0]
       ¦            sum += z;
 39.48 ¦      add    %eax,%ebx
       ¦    }

For -fno-omit-frame-pointer:

 404,625,639 stalled-cycles-frontend   #    0.00% frontend cycles idle
      10,514 cache-misses
 0.375445137 seconds time elapsed
 [...]
 75.35%  a.out  a.out              [.] add(int const&, int const&) [clone .isra.0]                                                                                     ¦
 24.46%  a.out  a.out              [.] work(int, int)
 [...]
       ¦   __attribute__((noinline))
       ¦   static int add(const int& x, const int& y) {
 18.67 ¦     push   %rbp
       ¦       return x + y;
 18.49 ¦     lea    (%rdi,%rsi,1),%eax
       ¦   const int LOOP_BOUND = 200000000;
       ¦
       ¦   __attribute__((noinline))
       ¦   static int add(const int& x, const int& y) {
       ¦     mov    %rsp,%rbp
       ¦       return x + y;
       ¦   }
 12.71 ¦     pop    %rbp
       ¦   ? retq
 [...]
       ¦            int z = add(x, y);
       ¦    ? callq  add(int const&, int const&) [clone .isra.0]
       ¦            sum += z;
 29.83 ¦      add    %eax,%ebx

It looks like we are stalling on the call to add() in the slow case.

I have examined everything that perf -e can spit out on my machine; not just the stats that are given above.

For the same executable, the stalled-cycles-frontend shows linear correlation with the execution time; I did not notice anything else that would correlate so clearly. (Comparing stalled-cycles-frontend for different executables doesn't make sense to me.)

I included the cache misses as it came up as the first comment. I examined all the cache misses that can be measured on my machine by perf, not just the ones given above. The cache misses are very very noisy and show little to no correlation with the execution times.

转载于:https://stackoverflow.com/questions/19470873/why-does-gcc-generate-15-20-faster-code-if-i-optimize-for-size-instead-of-speed

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

6条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
胖鸭 2013-10-24 15:32
关注
My colleague helped me find a plausible answer to my question. He noticed the importance of the 256 byte boundary. He is not registered here and encouraged me to post the answer myself (and take all the fame).

Short answer:

Is it the padding that is the culprit in this case? Why and how?

It all boils down to alignment. Alignments can have a significant impact on the performance, that is why we have the -falign-* flags in the first place.

I have submitted a (bogus?) bug report to the gcc developers. It turns out that the default behavior is "we align loops to 8 byte by default but try to align it to 16 byte if we don't need to fill in over 10 bytes." Apparently, this default is not the best choice in this particular case and on my machine. Clang 3.4 (trunk) with -O3 does the appropriate alignment and the generated code does not show this weird behavior.

Of course, if an inappropriate alignment is done, it makes things worse. An unnecessary / bad alignment just eats up bytes for no reason and potentially increases cache misses, etc.

The noise it makes pretty much makes timing micro-optimizations impossible.

How can I make sure that such accidental lucky / unlucky alignments are not interfering when I do micro-optimizations (unrelated to stack alignment) on C or C++ source codes?

Simply by telling gcc to do the right alignment:

g++ -O2 -falign-functions=16 -falign-loops=16

Long answer:

The code will run slower if:

an XX byte boundary cuts add() in the middle (XX being machine dependent).

if the call to add() has to jump over an XX byte boundary and the target is not aligned.

if add() is not aligned.

if the loop is not aligned.

The first 2 are beautifully visible on the codes and results that Marat Dukhan kindly posted. In this case, gcc-4.8.1 -Os (executes in 0.994 secs):

00000000004004fd <_ZL3addRKiS0_.isra.0>: 4004fd: 8d 04 37 lea eax,[rdi+rsi*1] 400500: c3

a 256 byte boundary cuts add() right in the middle and neither add() nor the loop is aligned. Surprise, surprise, this is the slowest case!

In case gcc-4.7.3 -Os (executes in 0.822 secs), the 256 byte boundary only cuts into a cold section (but neither the loop, nor add() is cut):

00000000004004fa <_ZL3addRKiS0_.isra.0>: 4004fa: 8d 04 37 lea eax,[rdi+rsi*1] 4004fd: c3 ret [...] 40051a: e8 db ff ff ff call 4004fa <_ZL3addRKiS0_.isra.0>

Nothing is aligned, and the call to add() has to jump over the 256 byte boundary. This code is the second slowest.

In case gcc-4.6.4 -Os (executes in 0.709 secs), although nothing is aligned, the call to add() doesn't have to jump over the 256 byte boundary and the target is exactly 32 byte away:

4004f2: e8 db ff ff ff call 4004d2 <_ZL3addRKiS0_.isra.0> 4004f7: 01 c3 add ebx,eax 4004f9: ff cd dec ebp 4004fb: 75 ec jne 4004e9 <_ZL4workii+0x13>

This is the fastest of all three. Why the 256 byte boundary is speacial on his machine, I will leave it up to him to figure it out. I don't have such a processor.

Now, on my machine I don't get this 256 byte boundary effect. Only the function and the loop alignment kicks in on my machine. If I pass g++ -O2 -falign-functions=16 -falign-loops=16 then everything is back to normal: I always get the fastest case and the time isn't sensitive to the -fno-omit-frame-pointer flag anymore. I can pass g++ -O2 -falign-functions=32 -falign-loops=32 or any multiples of 16, the code is not sensitive to that either.

I first noticed in 2009 that gcc (at least on my projects and on my machines) have the tendency to generate noticeably faster code if I optimize for size (-Os) instead of speed (-O2 or -O3) and I have been wondering ever since why.

A likely explanation is that I had hotspots which were sensitive to the alignment, just like the one in this example. By messing with the flags (passing -Os instead of -O2), those hotspots were aligned in a lucky way by accident and the code became faster. It had nothing to do with optimizing for size: These were by sheer accident that the hotspots got aligned better. From now on, I will check the effects of alignment on my projects.

Oh, and one more thing. How can such hotspots arise, like the one shown in the example? How can the inlining of such a tiny function like add() fail?

Consider this:

// add.cpp int add(const int& x, const int& y) { return x + y; }

and in a separate file:

// main.cpp int add(const int& x, const int& y); const int LOOP_BOUND = 200000000; __attribute__((noinline)) static int work(int xval, int yval) { int sum(0); for (int i=0; i<LOOP_BOUND; ++i) { int x(xval+sum); int y(yval+sum); int z = add(x, y); sum += z; } return sum; } int main(int , char* argv[]) { int result = work(*argv[1], *argv[2]); return result; }

and compiled as: g++ -O2 add.cpp main.cpp.

gcc won't inline add()!

That's all, it's that easy to unintendedly create hotspots like the one in the OP. Of course it is partly my fault: gcc is an excellent compiler. If compile the above as: g++ -O2 -flto add.cpp main.cpp, that is, if I perform link time optimization, the code runs in 0.19s!

(Inlining is artificially disabled in the OP, hence, the code in the OP was 2x slower).
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(5条)

报告相同问题？

关注问题

如果我根据大小而不是速度进行优化，为什么 GCC 生成的代码要快15-20% ？ c++
2013-10-19 20:36

回答 6 已采纳 My colleague helped me find a plausible answer to my question. He noticed the importance of the 25
如果我对大小而不是速度进行优化，为什么 GCC 能够快速生成15-20% 的代码？ c++
2013-10-19 20:36

回答 6 已采纳 My colleague helped me find a plausible answer to my question. He noticed the importance of the 25
为什么在dev c++中能正常运行的代码，在gcc中无法运行 c语言有问必答
2022-10-09 16:54

回答 4 已采纳 fun函数中的scanf("%d",&n)移动到main函数的第一个printf后面，在main函数最后加return 0; 生成可执行程序后，直接运行就可以了， gcc a.exe的写法是错误的，
【GCC编译优化系列】前后编译的两个版本固件bin大小不一样，怎么办？
2022-09-13 10:31

架构师李肯的博客这两天在论坛收到一个朋友的问题回答邀请，我仔细读了下该问题，跟我之前在论坛上发布的好几个问题关联还挺大的，所以抽空带着这个问题，重新梳理下思路，也希望这些思路能帮到这位朋友尽快解决问题。
为什么参加考试突然需要“ gcc”？
2019-03-31 14:52

回答 1 已采纳 It looks like you're encountering this issue: https://github.com/golang/go/issues/26988 The worka
gcc 编译器和clang编译器输出结果为什么不一样？ c++ c语言
2022-09-26 11:45

回答 3 已采纳 c语言中，变量定义时是不自动初始化的而在c++中，变量定义时是自动初始化的而c和c++很多语法是兼容的所以你用不同的IDE来编译相同的代码，有时就正确，有时就错误还有，既然a是随机值，如果它随机到的值
为什么gcc编译的时候出错？Abort trap: 6 signal terminated program cc1 c语言
2022-05-12 15:47

回答 2 已采纳好家伙，原来是gcc必须放到/usr/local/下面，不能放到电脑上其他位置，结束回答！
【C/C++ 编译相关 gcc】一次搞懂GCC编译选项：优化代码、调试程序必备！
2022-07-04 10:03

泡沫o0的博客一次搞懂GCC编译选项：优化代码、调试程序必备！
为什么Golang比ANSI C快，我如何优化解决方案？
2018-09-27 09:17

回答 1 已采纳 All your tested functions are trivially equivalent to void no_op(int) {}. The large timing differe
C++里这一句优化是什么意思？怎么用正常的代码替代？ c++
2022-11-05 10:28

回答 2 已采纳 inline参数的含义是建议编译器内联，实际是否内联由编译器决定。内联后编译器会为函数生成特殊的汇编代码。使用内联函数有如下优势：内联函数拥有更快的调用速度（减少了函数调用）；可以实现把函数代码集成
centos6.10 升级gcc4.8.5后仍无法make c11代码？ c++ centos c语言 linux
2020-07-20 14:51

回答 4 已采纳 c/c++环境&cmake tool 下没？要不换个系统吧
C++(Qt)软件调试---验证GCC编译优化和生成调试信息（8）
2023-04-08 14:59

mahuifa的博客我们可以通过学习GCC编译器参数，在程序编译时选择合适的优化参数和生成调试信息参数，在运行性能、程序大小、调试方便三个方向进行权衡利弊。例如在不需要考虑性能时可以完全关闭优化，生成尽可能多的调试信息，以...
C++(Qt)软件调试---qmake编译优化和生成调试信息（9）
2023-04-09 15:32

mahuifa的博客当我们遇见特殊的问题时就需要手动修改编译器选项，在程序编译时选择合适的优化参数和生成调试信息参数，在运行性能、程序大小、调试方便三个方向进行权衡利弊。例如在不需要考虑性能时可以完全关闭优化，生成尽...
epbf原理篇 -------- epbf编程语言
2023-12-31 09:33

self-motivation的博客 1.6万字bpf正如上文中提到Linus对epbf的评价ebpf的强大的可编程性几乎可以最大程度地的满足我们性能分析、...既然类似于一门编程语言，我们就可以从学习一门编程语言的角度来学习它。看看ebpf都提供给我们哪些编程便利.
LLVM代码空间优化（一）编译器自带的优化选项
2023-05-12 22:32

狂奔的乌龟的博客 LLVM的一个设计思想是优化可以渗透在整个编译流程中各个阶段，比如编译时、链接时、运行时等。你可以基于LLVM提供的功能开发自己的模块，并集成在LLVM系统上，增加它的功能；或者利用LLVM来支撑底层实现，开发自己的...
没有解决我的问题, 去提问

悬赏问题

¥15 电力市场出清matlab yalmip kkt 双层优化问题
¥30 ros小车路径规划实现不了，如何解决？(操作系统-ubuntu)
¥20 matlab yalmip kkt 双层优化问题
¥15 如何在3D高斯飞溅的渲染的场景中获得一个可控的旋转物体
¥88 实在没有想法，需要个思路
¥15 MATLAB报错输入参数太多
¥15 python中合并修改日期相同的CSV文件并按照修改日期的名字命名文件
¥15 有赏，i卡绘世画不出
¥15 如何用stata画出文献中常见的安慰剂检验图
¥15 c语言链表结构体数据插入

如果我根据大小而不是速度进行优化，为什么 GCC 生成的代码要快15-20% ？

6条回答 默认 最新

悬赏问题

6条回答默认最新