执行更多指令如何加快执行速度
当我运行以下函数时,我得到了一些意想不到的结果。
在我的机器上,下面的代码始终需要大约 6 秒才能运行。但是,如果我取消注释“;dec [variable + 24]
”行,因此会执行更多代码,运行时间大约为 4.5 秒。为什么?
.DATA
variable dq 0 dup(4)
.CODE
runAssemblyCode PROC
mov rax, 2330 * 1000 * 1000
start:
dec [variable]
dec [variable + 8]
dec [variable + 16]
;dec [variable + 24]
dec rax
jnz start
ret
runAssemblyCode ENDP
END
我注意到 Stack Overflow 上已经有类似的问题,但他们的代码示例并不像这样简单,我找不到这个问题的任何简洁答案。
我尝试用 nop 指令填充代码,看看是否是对齐问题,并将亲和力设置为单个处理器。两者都没有任何区别。
When I run the following function, I get somewhat unexpected results.
On my machine, the code below consistently takes about 6 seconds to run. However, if I uncomment the ";dec [variable + 24]
" line, therefore executing more code it takes about 4.5 seconds to run. Why?
.DATA
variable dq 0 dup(4)
.CODE
runAssemblyCode PROC
mov rax, 2330 * 1000 * 1000
start:
dec [variable]
dec [variable + 8]
dec [variable + 16]
;dec [variable + 24]
dec rax
jnz start
ret
runAssemblyCode ENDP
END
I have noticed that there are similar questions already on Stack Overflow, but their code samples are not as simple as this and I couldn't find any succinct answers to this question.
I have tried padding the code with nop instructions to see if it is an alignment problem, and also set the affinity to a single processor. Neither made any difference.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
简单的答案是因为现代CPU 极其复杂。在观察者看来,幕后发生的许多事情似乎是不可预测或随机的。
插入额外的指令可能会导致它以不同的方式调度指令,这在这样的紧密循环中可能会产生影响。但这只是一个猜测。
据我所知,它与前一条指令触及相同的缓存行,因此它似乎不是一种预取。我真的想不出一个合乎逻辑的解释,但同样,CPU 利用大量未记录的启发式方法和猜测来尽可能快地执行代码,有时,这意味着它们失败的奇怪的极端情况,并且代码变成比你想象的要慢。
你在不同的CPU型号上测试过这个吗?看看这是否只发生在您的特定 CPU 上,或者其他 x86 CPU 是否也出现同样的情况,将会很有趣。
The simple answer is because modern CPUs are extremely complex. There is a lot going on under the hood that appears unpredictable or random to the observer.
Inserting that extra instruction might cause it to schedule instructions differently, which, in a tight loop like this, might make a difference. But that's just a guess.
As far as I can see, it touches the same cache line as the previous instruction, so it doesn't seem to be a kind of prefetching. I can't really think of a logical explanation, but again, the CPU makes use of a lot of undocumented heuristics and guesses to execute code as fast as possible, and sometimes, that means weird corner cases where they fail, and the code becomes slower than you'd expect.
Have you tested this on different CPU models? Would be interesting to see if this is just on your specific CPU, or if other x86 CPUs exhibit the same thing.
bob.s
ted.c
gcc -O2 ted.c bob.s -o ted
这是带有额外指令的:
我没有看到差异,也许你可以纠正我的代码,或者其他人可以在他们的系统上尝试看看他们看到了什么。 ..
这是一条极其痛苦的指令,而且如果您正在执行除基于字节的内存递减之外的其他操作,那么这些指令是未对齐的,并且会给内存系统带来痛苦。因此这个例程应该对缓存行以及核心数量等敏感。
无论有没有额外的指令,它都花费了大约 13 秒。
AMD Phenom 9950 四核处理器
在
Intel(R) Core(TM)2 CPU 6300
无论有或没有额外指令,
上运行大约需要 9-10 秒。两个处理器:
Intel(R) Xeon(TM) CPU
无论有或没有额外指令,
大约需要 13 秒。对此:
Intel(R) Core(TM)2 Duo CPU T7500
8 秒(带或不带)。
所有都运行 Ubuntu 64 位 10.04 或 10.10,可能还有 11.04。
更多机器,64 位,ubuntu
Intel(R) Xeon(R) CPU X5450(8 核)
6 秒,有或没有额外指令。
Intel(R) Xeon(R) CPU E5405(8 核)
9 秒,有或没有。
您系统中 DDR/DRAM 的速度是多少?您正在运行哪种处理器(如果在 Linux 上,则为 cat /proc/cpuinfo)。
Intel(R) Xeon(R) CPU E5440(8 核)
6 秒,有或没有
啊,找到了一个单核,xeon:
Intel(R) Xeon(TM) CPU
15 秒(带或不带额外指令)
bob.s
ted.c
gcc -O2 ted.c bob.s -o ted
this was with the extra instruction:
I dont see a difference, maybe you can correct my code or others can try on their systems to see what they see...
that is an extremely painful instruction plus if you are doing something other than byte based memory decrements that is unaligned and going to be painful for the memory system. so this routine should be sensitive to cache lines as well as number of cores, etc.
it took about 13 seconds with or without the extra instruction.
amd phenom 9950 quad-core processor
on an
Intel(R) Core(TM)2 CPU 6300
took about 9-10 seconds with or without the extra instruction.
A two processor:
Intel(R) Xeon(TM) CPU
took about 13 seconds with or without the extra instruction.
On this:
Intel(R) Core(TM)2 Duo CPU T7500
8 seconds with or without.
All are running Ubuntu 64 bit 10.04 or 10.10, might be an 11.04 in there.
Some more machines, 64 bit, ubuntu
Intel(R) Xeon(R) CPU X5450 (8 core)
6 seconds with or without extra instruction.
Intel(R) Xeon(R) CPU E5405 (8 core)
9 seconds with or without.
What is the speed of your DDR/DRAM in your system? What kind of processor are you running (cat /proc/cpuinfo if on linux).
Intel(R) Xeon(R) CPU E5440 (8 core)
6 seconds with or without
Ahh, found a single core, xeon though:
Intel(R) Xeon(TM) CPU
15 seconds with or without the extra instruction
没那么糟糕。平均而言,完整循环的执行时间为 2.6 ns,而另一个循环的执行时间为 1.9 ns。假设 2GHz CPU 的周期为 0.5 ns,则差异约为每个循环
(2.6 - 1.9) / 0.5 = 1 个时钟周期
,这并不奇怪。不过,由于您请求的周期数,时间差异变得如此明显:
0.5 ns * 2330000000 = 1.2 秒
,即您观察到的差异。It's not that bad. On average, the complete loop takes 2.6 ns to execute, while the other takes 1.9 ns. Assuming a 2GHz CPU, which has a period of 0.5 ns, the difference is about
(2.6 - 1.9) / 0.5 = 1 clock cycle
per loop, nothing surprising.The time difference becomes so noticeable, though, due to the number of cycles you requested:
0.5 ns * 2330000000 = 1.2 seconds
, the difference you observed.