为什么将变量声明为 volatile 会加快代码执行速度?
有什么想法吗?我在 PPC750 上使用 GCC 交叉编译器。在循环中对两个浮点数进行简单的乘法运算并计时。我将变量声明为易失性,以确保没有优化任何重要的内容,并且代码速度加快了!
我检查了这两种情况的汇编指令,果然,编译器生成了更多指令来在非易失性情况下执行相同的基本工作。 10,000,000 次迭代的执行时间从 800 毫秒下降到 300 毫秒!
易失性情况的汇编:
0x10eeec stwu r1,-32(r1)
0x10eef0 lis r9,0x1d # 29
0x10eef4 lis r11,0x4080 # 16512
0x10eef8 lfs fr0,-18944(r9)
0x10eefc li r0,0x0 # 0
0x10ef00 lis r9,0x98 # 152
0x10ef04 stfs fr0,8(r1)
0x10ef08 mtspr CTR,r9
0x10ef0c stw r11,12(r1)
0x10ef10 stw r0,16(r1)
0x10ef14 ori r9,r9,0x9680
0x10ef18 mtspr CTR,r9
0x10ef1c lfs fr0,8(r1)
0x10ef20 lfs fr13,12(r1)
0x10ef24 fmuls fr0,fr0,fr13
0x10ef28 stfs fr0,16(r1)
0x10ef2c bc 0x10,0, 0x10ef1c # 0x0010ef1c
0x10ef30 addi r1,r1,0x20 # 32
非易失性情况的汇编:
0x10ef04 stwu r1,-48(r1)
0x10ef08 stw r31,44(r1)
0x10ef0c or r31,r1,r1
0x10ef10 lis r9,0x1d # 29
0x10ef14 lfs fr0,-18832(r9)
0x10ef18 stfs fr0,12(r31)
0x10ef1c lis r0,0x4080 # 16512
0x10ef20 stw r0,16(r31)
0x10ef24 li r0,0x0 # 0
0x10ef28 stw r0,20(r31)
0x10ef2c li r0,0x0 # 0
0x10ef30 stw r0,8(r31)
0x10ef34 lwz r0,8(r31)
0x10ef38 lis r9,0x98 # 152
0x10ef3c ori r9,r9,0x967f
0x10ef40 cmpl crf0,0,r0,r9
0x10ef44 bc 0x4,1, 0x10ef4c # 0x0010ef4c
0x10ef48 b 0x10ef6c # 0x0010ef6c
0x10ef4c lfs fr0,12(r31)
0x10ef50 lfs fr13,16(r31)
0x10ef54 fmuls fr0,fr0,fr13
0x10ef58 stfs fr0,20(r31)
0x10ef5c lwz r9,8(r31)
0x10ef60 addi r0,r9,0x1 # 1
0x10ef64 stw r0,8(r31)
0x10ef68 b 0x10ef34 # 0x0010ef34
0x10ef6c lwz r11,0(r1)
0x10ef70 lwz r31,-4(r11)
0x10ef74 or r1,r11,r11
0x10ef78 blr
如果我正确地理解了这一点,它会在每次迭代期间从内存中加载值在两种情况下,但它似乎生成了更多的指令要做所以在非易失性的情况下。
这是来源:
void floatTest()
{
unsigned long i;
volatile double d1 = 500.234, d2 = 4.000001, d3=0;
for(i=0; i<10000000; i++)
d3 = d1*d2;
}
Any ideas? I'm using the GCC cross-compiler for a PPC750. Doing a simple multiply operation of two floating-point numbers in a loop and timing it. I declared the variables to be volatile to make sure nothing important was optimized out, and the code sped up!
I've inspected the assembly instructions for both cases and, sure enough, the compiler generated many more instructions to do the same basic job in the non-volatile case. Execution time for 10,000,000 iterations dropped from 800ms to 300ms!
assembly for volatile case:
0x10eeec stwu r1,-32(r1)
0x10eef0 lis r9,0x1d # 29
0x10eef4 lis r11,0x4080 # 16512
0x10eef8 lfs fr0,-18944(r9)
0x10eefc li r0,0x0 # 0
0x10ef00 lis r9,0x98 # 152
0x10ef04 stfs fr0,8(r1)
0x10ef08 mtspr CTR,r9
0x10ef0c stw r11,12(r1)
0x10ef10 stw r0,16(r1)
0x10ef14 ori r9,r9,0x9680
0x10ef18 mtspr CTR,r9
0x10ef1c lfs fr0,8(r1)
0x10ef20 lfs fr13,12(r1)
0x10ef24 fmuls fr0,fr0,fr13
0x10ef28 stfs fr0,16(r1)
0x10ef2c bc 0x10,0, 0x10ef1c # 0x0010ef1c
0x10ef30 addi r1,r1,0x20 # 32
asssembly for non-volatile case:
0x10ef04 stwu r1,-48(r1)
0x10ef08 stw r31,44(r1)
0x10ef0c or r31,r1,r1
0x10ef10 lis r9,0x1d # 29
0x10ef14 lfs fr0,-18832(r9)
0x10ef18 stfs fr0,12(r31)
0x10ef1c lis r0,0x4080 # 16512
0x10ef20 stw r0,16(r31)
0x10ef24 li r0,0x0 # 0
0x10ef28 stw r0,20(r31)
0x10ef2c li r0,0x0 # 0
0x10ef30 stw r0,8(r31)
0x10ef34 lwz r0,8(r31)
0x10ef38 lis r9,0x98 # 152
0x10ef3c ori r9,r9,0x967f
0x10ef40 cmpl crf0,0,r0,r9
0x10ef44 bc 0x4,1, 0x10ef4c # 0x0010ef4c
0x10ef48 b 0x10ef6c # 0x0010ef6c
0x10ef4c lfs fr0,12(r31)
0x10ef50 lfs fr13,16(r31)
0x10ef54 fmuls fr0,fr0,fr13
0x10ef58 stfs fr0,20(r31)
0x10ef5c lwz r9,8(r31)
0x10ef60 addi r0,r9,0x1 # 1
0x10ef64 stw r0,8(r31)
0x10ef68 b 0x10ef34 # 0x0010ef34
0x10ef6c lwz r11,0(r1)
0x10ef70 lwz r31,-4(r11)
0x10ef74 or r1,r11,r11
0x10ef78 blr
If I read this correctly, it's loading the values from memory during every iteration in both cases, but it seems to have generated a lot more instructions to do so in the non-volatile case.
Here's the source:
void floatTest()
{
unsigned long i;
volatile double d1 = 500.234, d2 = 4.000001, d3=0;
for(i=0; i<10000000; i++)
d3 = d1*d2;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您确定您没有也更改优化设置吗?
原始版本看起来未优化 - 这是循环部分:
易失性版本看起来相当优化 - 它重新排列指令,并使用专用计数器寄存器而不是在堆栈上存储
i
:CTR 优化减少了循环到 5 条指令,而不是原始代码中的 14 条指令。我不认为“易失性”本身有任何理由可以实现这种优化。
Are you sure you didn't also change optimization settings?
The original looks un-optimized - here's the looping part:
The volatile version looks quite optimized - It rearanges instructions, and uses the special purpose counter register instead of storing
i
on the stack:The CTR optimization reduces the loop to 5 instructions, instead of the 14 in the original code. I don't see any reason 'volatile' by itself would enable that optimization.