gcc 会自动“展开”吗? if 语句?

发布于 2024-10-17 10:49:48 字数 603 浏览 2 评论 0原文

假设我有一个如下所示的循环:

for(int i = 0; i < 10000; i++) {
    /* Do something computationally expensive */
    if (i < 200 && !(i%20)) {
        /* Do something else */
    }
}

其中一些琐碎的任务被困在仅运行几次的 if 语句后面。 我总是听说“循环中的 if 语句很慢!”因此,为了(稍微)提高性能,我将循环分为:

for(int i = 0; i < 200; i++) {
    /* Do something computationally expensive */
    if (!(i%20)) {
        /* Do something else */
    }
}

for(int i = 200; i < 10000; i++) {
    /* Do something computationally expensive */
}

gcc(带有适当的标志,如 -O3)会自动将一个循环分成两个,还是仅展开以减少迭代次数?

Say I have a loop that looks like this:

for(int i = 0; i < 10000; i++) {
    /* Do something computationally expensive */
    if (i < 200 && !(i%20)) {
        /* Do something else */
    }
}

wherein some trivial task gets stuck behind an if-statement that only runs a handful of times.
I've always heard that "if-statements in loops are slow!" So, in the hopes of (marginally) increased performance, I split the loops apart into:

for(int i = 0; i < 200; i++) {
    /* Do something computationally expensive */
    if (!(i%20)) {
        /* Do something else */
    }
}

for(int i = 200; i < 10000; i++) {
    /* Do something computationally expensive */
}

Will gcc (with the appropriate flags, like -O3) automatically break the one loop into two, or does it only unroll to decrease the number of iterations?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

2024-10-24 10:49:48

为什么不直接反汇编程序并亲自看看呢?但我们开始了。这是测试程序:

int main() {
    int sum = 0;
    int i;
    for(i = 0; i < 10000; i++) {
        if (i < 200 && !(i%20)) {
            sum += 0xC0DE;
        }
        sum += 0xCAFE;
    }
    printf("%d\n", sum);
    return 0;
}

这是使用 gcc 4.3.3 和 -o3 编译的反汇编代码的有趣部分:

0x08048404 <main+20>:   xor    ebx,ebx
0x08048406 <main+22>:   push   ecx
0x08048407 <main+23>:   xor    ecx,ecx
0x08048409 <main+25>:   sub    esp,0xc
0x0804840c <main+28>:   lea    esi,[esi+eiz*1+0x0]
0x08048410 <main+32>:   cmp    ecx,0xc7
0x08048416 <main+38>:   jg     0x8048436 <main+70>
0x08048418 <main+40>:   mov    eax,ecx
0x0804841a <main+42>:   imul   esi
0x0804841c <main+44>:   mov    eax,ecx
0x0804841e <main+46>:   sar    eax,0x1f
0x08048421 <main+49>:   sar    edx,0x3
0x08048424 <main+52>:   sub    edx,eax
0x08048426 <main+54>:   lea    edx,[edx+edx*4]
0x08048429 <main+57>:   shl    edx,0x2
0x0804842c <main+60>:   cmp    ecx,edx
0x0804842e <main+62>:   jne    0x8048436 <main+70>
0x08048430 <main+64>:   add    ebx,0xc0de
0x08048436 <main+70>:   add    ecx,0x1
0x08048439 <main+73>:   add    ebx,0xcafe
0x0804843f <main+79>:   cmp    ecx,0x2710
0x08048445 <main+85>:   jne    0x8048410 <main+32>
0x08048447 <main+87>:   mov    DWORD PTR [esp+0x8],ebx
0x0804844b <main+91>:   mov    DWORD PTR [esp+0x4],0x8048530
0x08048453 <main+99>:   mov    DWORD PTR [esp],0x1
0x0804845a <main+106>:  call   0x8048308 <__printf_chk@plt>

所以正如我们所看到的,对于这个特定的示例,不,它没有。我们只有一个循环,从 main+32 开始,到 main+85 结束。如果您在阅读汇编代码时遇到问题 ecx = i; ebx = 总和。

但您的里程仍然可能会有所不同 - 谁知道在这种特殊情况下使用什么启发式方法,因此您必须编译您想到的代码,并查看更长/更复杂的计算如何影响优化器。

尽管在任何现代 CPU 上,分支预测器都会在如此简单的代码上表现得非常好,所以在任何一种情况下您都不会看到太多的性能损失。如果您的计算密集型代码需要数十亿个周期,那么少数错误预测可能会造成哪些性能损失?

Why not just disassemble the program and see for yourself? But here we go. This is the testprogram:

int main() {
    int sum = 0;
    int i;
    for(i = 0; i < 10000; i++) {
        if (i < 200 && !(i%20)) {
            sum += 0xC0DE;
        }
        sum += 0xCAFE;
    }
    printf("%d\n", sum);
    return 0;
}

and this is the interesting part of the disassembled code compiled with gcc 4.3.3 and -o3:

0x08048404 <main+20>:   xor    ebx,ebx
0x08048406 <main+22>:   push   ecx
0x08048407 <main+23>:   xor    ecx,ecx
0x08048409 <main+25>:   sub    esp,0xc
0x0804840c <main+28>:   lea    esi,[esi+eiz*1+0x0]
0x08048410 <main+32>:   cmp    ecx,0xc7
0x08048416 <main+38>:   jg     0x8048436 <main+70>
0x08048418 <main+40>:   mov    eax,ecx
0x0804841a <main+42>:   imul   esi
0x0804841c <main+44>:   mov    eax,ecx
0x0804841e <main+46>:   sar    eax,0x1f
0x08048421 <main+49>:   sar    edx,0x3
0x08048424 <main+52>:   sub    edx,eax
0x08048426 <main+54>:   lea    edx,[edx+edx*4]
0x08048429 <main+57>:   shl    edx,0x2
0x0804842c <main+60>:   cmp    ecx,edx
0x0804842e <main+62>:   jne    0x8048436 <main+70>
0x08048430 <main+64>:   add    ebx,0xc0de
0x08048436 <main+70>:   add    ecx,0x1
0x08048439 <main+73>:   add    ebx,0xcafe
0x0804843f <main+79>:   cmp    ecx,0x2710
0x08048445 <main+85>:   jne    0x8048410 <main+32>
0x08048447 <main+87>:   mov    DWORD PTR [esp+0x8],ebx
0x0804844b <main+91>:   mov    DWORD PTR [esp+0x4],0x8048530
0x08048453 <main+99>:   mov    DWORD PTR [esp],0x1
0x0804845a <main+106>:  call   0x8048308 <__printf_chk@plt>

So as we see, for this particular example, no it does not. We have only one loop starting at main+32 and ending at main+85. If you've got problems reading the assembly code ecx = i; ebx = sum.

But still your mileage may vary - who knows what heuristics are used for this particular case, so you'll have to compile the code you've got in mind and see how longer/more complicated computations influence the optimizer.

Though on any modern CPU the branch predictor will do pretty good on such easy code, so you won't see much performance losses in either case. What's the performance loss of maybe a handful mispredictions if your computation intense code needs billions of cycles?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文