为什么将变量声明为 volatile 会加快代码执行速度？

发布于 2024-12-06 19:24:38 字数 2116 浏览 3 评论 0原文

有什么想法吗？我在 PPC750 上使用 GCC 交叉编译器。在循环中对两个浮点数进行简单的乘法运算并计时。我将变量声明为易失性，以确保没有优化任何重要的内容，并且代码速度加快了！

我检查了这两种情况的汇编指令，果然，编译器生成了更多指令来在非易失性情况下执行相同的基本工作。 10,000,000 次迭代的执行时间从 800 毫秒下降到 300 毫秒！

易失性情况的汇编：

0x10eeec  stwu  r1,-32(r1)
0x10eef0  lis   r9,0x1d # 29
0x10eef4  lis   r11,0x4080 # 16512
0x10eef8  lfs   fr0,-18944(r9)
0x10eefc  li    r0,0x0 # 0
0x10ef00  lis   r9,0x98 # 152
0x10ef04  stfs  fr0,8(r1)
0x10ef08  mtspr CTR,r9
0x10ef0c  stw   r11,12(r1)
0x10ef10  stw   r0,16(r1)
0x10ef14  ori   r9,r9,0x9680
0x10ef18  mtspr CTR,r9
0x10ef1c  lfs   fr0,8(r1)
0x10ef20  lfs   fr13,12(r1)
0x10ef24  fmuls fr0,fr0,fr13
0x10ef28  stfs  fr0,16(r1)
0x10ef2c  bc    0x10,0, 0x10ef1c # 0x0010ef1c
0x10ef30  addi  r1,r1,0x20 # 32

非易失性情况的汇编：

0x10ef04  stwu        r1,-48(r1)
0x10ef08  stw         r31,44(r1)
0x10ef0c  or          r31,r1,r1
0x10ef10  lis         r9,0x1d # 29
0x10ef14  lfs         fr0,-18832(r9)
0x10ef18  stfs        fr0,12(r31)
0x10ef1c  lis         r0,0x4080 # 16512
0x10ef20  stw         r0,16(r31)
0x10ef24  li          r0,0x0 # 0
0x10ef28  stw         r0,20(r31)
0x10ef2c  li          r0,0x0 # 0
0x10ef30  stw         r0,8(r31)
0x10ef34  lwz         r0,8(r31)
0x10ef38  lis         r9,0x98 # 152
0x10ef3c  ori         r9,r9,0x967f
0x10ef40  cmpl        crf0,0,r0,r9
0x10ef44  bc          0x4,1, 0x10ef4c # 0x0010ef4c
0x10ef48  b           0x10ef6c # 0x0010ef6c
0x10ef4c  lfs         fr0,12(r31)
0x10ef50  lfs         fr13,16(r31)
0x10ef54  fmuls       fr0,fr0,fr13
0x10ef58  stfs        fr0,20(r31)
0x10ef5c  lwz         r9,8(r31)
0x10ef60  addi        r0,r9,0x1 # 1
0x10ef64  stw         r0,8(r31)
0x10ef68  b           0x10ef34 # 0x0010ef34
0x10ef6c  lwz         r11,0(r1)
0x10ef70  lwz         r31,-4(r11)
0x10ef74  or          r1,r11,r11
0x10ef78  blr

如果我正确地理解了这一点，它会在每次迭代期间从内存中加载值在两种情况下，但它似乎生成了更多的指令要做所以在非易失性的情况下。

这是来源：

void floatTest()
{
    unsigned long i;
    volatile double d1 = 500.234, d2 = 4.000001, d3=0;
    for(i=0; i<10000000; i++)
        d3 = d1*d2;
}

原文

Any ideas? I'm using the GCC cross-compiler for a PPC750. Doing a simple multiply operation of two floating-point numbers in a loop and timing it. I declared the variables to be volatile to make sure nothing important was optimized out, and the code sped up!

I've inspected the assembly instructions for both cases and, sure enough, the compiler generated many more instructions to do the same basic job in the non-volatile case. Execution time for 10,000,000 iterations dropped from 800ms to 300ms!

assembly for volatile case:

0x10eeec  stwu  r1,-32(r1)
0x10eef0  lis   r9,0x1d # 29
0x10eef4  lis   r11,0x4080 # 16512
0x10eef8  lfs   fr0,-18944(r9)
0x10eefc  li    r0,0x0 # 0
0x10ef00  lis   r9,0x98 # 152
0x10ef04  stfs  fr0,8(r1)
0x10ef08  mtspr CTR,r9
0x10ef0c  stw   r11,12(r1)
0x10ef10  stw   r0,16(r1)
0x10ef14  ori   r9,r9,0x9680
0x10ef18  mtspr CTR,r9
0x10ef1c  lfs   fr0,8(r1)
0x10ef20  lfs   fr13,12(r1)
0x10ef24  fmuls fr0,fr0,fr13
0x10ef28  stfs  fr0,16(r1)
0x10ef2c  bc    0x10,0, 0x10ef1c # 0x0010ef1c
0x10ef30  addi  r1,r1,0x20 # 32

asssembly for non-volatile case:

0x10ef04  stwu        r1,-48(r1)
0x10ef08  stw         r31,44(r1)
0x10ef0c  or          r31,r1,r1
0x10ef10  lis         r9,0x1d # 29
0x10ef14  lfs         fr0,-18832(r9)
0x10ef18  stfs        fr0,12(r31)
0x10ef1c  lis         r0,0x4080 # 16512
0x10ef20  stw         r0,16(r31)
0x10ef24  li          r0,0x0 # 0
0x10ef28  stw         r0,20(r31)
0x10ef2c  li          r0,0x0 # 0
0x10ef30  stw         r0,8(r31)
0x10ef34  lwz         r0,8(r31)
0x10ef38  lis         r9,0x98 # 152
0x10ef3c  ori         r9,r9,0x967f
0x10ef40  cmpl        crf0,0,r0,r9
0x10ef44  bc          0x4,1, 0x10ef4c # 0x0010ef4c
0x10ef48  b           0x10ef6c # 0x0010ef6c
0x10ef4c  lfs         fr0,12(r31)
0x10ef50  lfs         fr13,16(r31)
0x10ef54  fmuls       fr0,fr0,fr13
0x10ef58  stfs        fr0,20(r31)
0x10ef5c  lwz         r9,8(r31)
0x10ef60  addi        r0,r9,0x1 # 1
0x10ef64  stw         r0,8(r31)
0x10ef68  b           0x10ef34 # 0x0010ef34
0x10ef6c  lwz         r11,0(r1)
0x10ef70  lwz         r31,-4(r11)
0x10ef74  or          r1,r11,r11
0x10ef78  blr

If I read this correctly, it's loading the values from memory during every iteration in both cases, but it seems to have generated a lot more instructions to do so in the non-volatile case.

Here's the source:

void floatTest()
{
    unsigned long i;
    volatile double d1 = 500.234, d2 = 4.000001, d3=0;
    for(i=0; i<10000000; i++)
        d3 = d1*d2;
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

毅然前行 2024-12-13 19:24:38

您确定您没有也更改优化设置吗？

原始版本看起来未优化 - 这是循环部分：

0x10ef34  lwz         r0,8(r31)       //Put 'i' in r0.
0x10ef38  lis         r9,0x98 # 152   //Put MSB of 10000000 in r9
0x10ef3c  ori         r9,r9,0x967f    //Put LSB of 10000000 in r9
0x10ef40  cmpl        crf0,0,r0,r9    //compare r0 to r9

0x10ef44  bc          0x4,1, 0x10ef4c //branch to loop if r0<r9
0x10ef48  b           0x10ef6c        //else branch to end

0x10ef4c  lfs         fr0,12(r31)     //load d1
0x10ef50  lfs         fr13,16(r31)    //load d2
0x10ef54  fmuls       fr0,fr0,fr13    //multiply
0x10ef58  stfs        fr0,20(r31)     //save d3

0x10ef5c  lwz         r9,8(r31)       //load i into r9
0x10ef60  addi        r0,r9,0x1       //add 1
0x10ef64  stw         r0,8(r31)       //save i

0x10ef68  b           0x10ef34        //go back to top, must reload r9

易失性版本看起来相当优化 - 它重新排列指令，并使用专用计数器寄存器而不是在堆栈上存储 i：

0x10ef00  lis   r9,0x98 # 152      //MSB of 10M
//.. 4 initialization instructions here ..
0x10ef14  ori   r9,r9,0x9680       //LSB of 10,000000
0x10ef18  mtspr CTR,r9             // store r9 in Special Purpose CTR register
0x10ef1c  lfs   fr0,8(r1)          // load d1
0x10ef20  lfs   fr13,12(r1)        // load d2
0x10ef24  fmuls fr0,fr0,fr13       // multiply
0x10ef28  stfs  fr0,16(r1)         // store result
0x10ef2c  bc    0x10,0, 0x10ef1c   // decrement counter and branch if not 0.

CTR 优化减少了循环到 5 条指令，而不是原始代码中的 14 条指令。我不认为“易失性”本身有任何理由可以实现这种优化。

Are you sure you didn't also change optimization settings?

The original looks un-optimized - here's the looping part:

0x10ef34  lwz         r0,8(r31)       //Put 'i' in r0.
0x10ef38  lis         r9,0x98 # 152   //Put MSB of 10000000 in r9
0x10ef3c  ori         r9,r9,0x967f    //Put LSB of 10000000 in r9
0x10ef40  cmpl        crf0,0,r0,r9    //compare r0 to r9

0x10ef44  bc          0x4,1, 0x10ef4c //branch to loop if r0<r9
0x10ef48  b           0x10ef6c        //else branch to end

0x10ef4c  lfs         fr0,12(r31)     //load d1
0x10ef50  lfs         fr13,16(r31)    //load d2
0x10ef54  fmuls       fr0,fr0,fr13    //multiply
0x10ef58  stfs        fr0,20(r31)     //save d3

0x10ef5c  lwz         r9,8(r31)       //load i into r9
0x10ef60  addi        r0,r9,0x1       //add 1
0x10ef64  stw         r0,8(r31)       //save i

0x10ef68  b           0x10ef34        //go back to top, must reload r9

The volatile version looks quite optimized - It rearanges instructions, and uses the special purpose counter register instead of storing i on the stack:

0x10ef00  lis   r9,0x98 # 152      //MSB of 10M
//.. 4 initialization instructions here ..
0x10ef14  ori   r9,r9,0x9680       //LSB of 10,000000
0x10ef18  mtspr CTR,r9             // store r9 in Special Purpose CTR register
0x10ef1c  lfs   fr0,8(r1)          // load d1
0x10ef20  lfs   fr13,12(r1)        // load d2
0x10ef24  fmuls fr0,fr0,fr13       // multiply
0x10ef28  stfs  fr0,16(r1)         // store result
0x10ef2c  bc    0x10,0, 0x10ef1c   // decrement counter and branch if not 0.

The CTR optimization reduces the loop to 5 instructions, instead of the 14 in the original code. I don't see any reason 'volatile' by itself would enable that optimization.

回复收藏 0 原文

~没有更多了~