x86 inc 与 add 指令的相对性能
快速问题,事先假设
mov eax, 0
哪个更有效?
inc eax
inc eax
或者
add eax, 2
此外,如果两个 inc
更快,编译器(例如 GCC)通常(即没有激进的优化标志)将 var += 2
优化为它?
PS:不要用“不要过早优化”的变体来回答,这只是学术兴趣。
Quick question, assuming beforehand
mov eax, 0
which is more efficient?
inc eax
inc eax
or
add eax, 2
Also, in case the two inc
s are faster, do compilers (say, the GCC) commonly (i.e. w/o aggressive optimization flags) optimize var += 2
to it?
PS: Don't bother to answer with a variation of "don't prematurely optimize", this is merely academic interest.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
同一寄存器上的两条 inc 指令(或者更一般地说两条读-修改-写指令)始终具有至少两个周期的依赖链。这是假设 inc 有一个时钟延迟,自 486 以来就是这种情况。这意味着如果周围的指令不能与两个 inc 指令交错以隐藏这些延迟,则代码执行速度将会变慢。
但无论如何,编译器都不会发出您建议的指令序列(
mov eax,0
将被xor eax,eax
替换,请参阅将寄存器与其自身进行异或的目的是什么?)它将被优化为
Two
inc
instructions on the same register (or more generally speaking two read-modify-write instructions) do always have a dependency chain of at least two cycles. This is assuming a one clock latency for a inc, which is the case since the 486. That means if the surrounding instructions can't be interleaved with the two inc instructions to hide those latencies, the code will execute slower.But no compiler will emit the instruction sequence you propose anyway (
mov eax,0
will be replaced byxor eax,eax
, see What is the purpose of XORing a register with itself?)it will be optimizied to
如果您想了解 x86 指令的原始性能统计数据,请参阅Agner Fogs 博士列表(准确地说是第 4 卷)。至于关于编译器的部分,那取决于编译器的代码生成器,而不是你应该过度依赖的东西。
旁注:我觉得有趣/讽刺的是,在有关性能的问题中,您使用 MOV EAX,0 来将寄存器归零,而不是 XOR EAX,EAX :P (如果
MOV EAX,0
事先完成,最快的变体是删除 inc 和 add,而只是MOV EAX,2
)。If you ever wanna know raw performance stats of x86 instructions, see Dr Agner Fogs listings (volume 4 to be exact). As for the part about compilers, thats dependent on the compiler's code generator, and not something you should rely on too much.
on a side note: I find it funny/ironic that in a question about performance, you used
MOV EAX,0
to zero a register instead ofXOR EAX,EAX
:P (and ifMOV EAX,0
was done beforehand, the fastest variant would be to remove the inc's and add's and justMOV EAX,2
).从英特尔手册中,您可以在此处找到ADD/SUB 指令在一种特定架构上便宜半个周期。但请记住,英特尔为其(最新的)处理器使用了无序执行模型。这主要意味着,只要处理器必须等待数据进入,就会出现性能瓶颈(例如,在 L1/L2/L3/RAM 数据获取期间,它没有事情可做)。因此,如果您的探查器告诉您 INC 可能是问题所在;从数据吞吐量的角度来看待它,而不是着眼于原始周期计数。
From the Intel manual that you can find here it looks like the ADD/SUB instructions are half a cycle cheaper on one particular architecture. But remember that Intel uses an out-of-order execution model for it's (recent) processors. This primarily means, performance bottlenecks show up wherever the processor has to wait for data to come in (eg. it ran out of things to do during the L1/L2/L3/RAM data-fetch). So if you're profiler tells you INC might be the problem; look at it form a data-throughput point of view instead of looking at raw cycle-counts.
出于所有目的,这可能并不重要。但请考虑到 inc 使用较少的字节。
考虑以下代码:
在不使用任何优化标志的情况下,GCC将此代码编译为:
使用
-O1
和-O2
,它变成:有趣,不是吗?
For all purposes, it probably doesn't matter. But take into account that inc uses less bytes.
Consider the following code:
Without using any optimization flags, GCC compiles this code into:
Using
-O1
and-O2
, it becomes:Funny, isn't it?