x86 inc 与 add 指令的相对性能

发布于 2024-11-06 21:03:36 字数 281 浏览 3 评论 0原文

快速问题,事先假设

mov eax, 0

哪个更有效?

inc eax
inc eax

或者

add eax, 2

此外,如果两个 inc 更快,编译器(例如 GCC)通常(即没有激进的优化标志)将 var += 2 优化为它?

PS:不要用“不要过早优化”的变体来回答,这只是学术兴趣。

Quick question, assuming beforehand

mov eax, 0

which is more efficient?

inc eax
inc eax

or

add eax, 2

Also, in case the two incs are faster, do compilers (say, the GCC) commonly (i.e. w/o aggressive optimization flags) optimize var += 2 to it?

PS: Don't bother to answer with a variation of "don't prematurely optimize", this is merely academic interest.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

小伙你站住 2024-11-13 21:03:36

同一寄存器上的两条 inc 指令(或者更一般地说两条读-修改-写指令)始终具有至少两个周期的依赖链。这是假设 inc 有一个时钟延迟,自 486 以来就是这种情况。这意味着如果周围的指令不能与两个 inc 指令交错以隐藏这些延迟,则代码执行速度将会变慢。

但无论如何,编译器都不会发出您建议的指令序列(mov eax,0 将被 xor eax,eax 替换,请参阅将寄存器与其自身进行异或的目的是什么?

mov eax,0
inc eax
inc eax

它将被优化为

mov eax,2

Two inc instructions on the same register (or more generally speaking two read-modify-write instructions) do always have a dependency chain of at least two cycles. This is assuming a one clock latency for a inc, which is the case since the 486. That means if the surrounding instructions can't be interleaved with the two inc instructions to hide those latencies, the code will execute slower.

But no compiler will emit the instruction sequence you propose anyway (mov eax,0 will be replaced by xor eax,eax, see What is the purpose of XORing a register with itself?)

mov eax,0
inc eax
inc eax

it will be optimizied to

mov eax,2
梦冥 2024-11-13 21:03:36

如果您想了解 x86 指令的原始性能统计数据,请参阅Agner Fogs 博士列表(准确地说是第 4 卷)。至于关于编译器的部分,那取决于编译器的代码生成器,而不是你应该过度依赖的东西。

旁注:我觉得有趣/讽刺的是,在有关性能的问题中,您使用 MOV EAX,0 来将寄存器归零,而不是 XOR EAX,EAX :P (如果 MOV EAX,0 事先完成,最快的变体是删除 inc 和 add,而只是 MOV EAX,2)。

If you ever wanna know raw performance stats of x86 instructions, see Dr Agner Fogs listings (volume 4 to be exact). As for the part about compilers, thats dependent on the compiler's code generator, and not something you should rely on too much.

on a side note: I find it funny/ironic that in a question about performance, you used MOV EAX,0 to zero a register instead of XOR EAX,EAX :P (and if MOV EAX,0 was done beforehand, the fastest variant would be to remove the inc's and add's and just MOV EAX,2).

陪你搞怪i 2024-11-13 21:03:36

从英特尔手册中,您可以在此处找到ADD/SUB 指令在一种特定架构上便宜半个周期。但请记住,英特尔为其(最新的)处理器使用了无序执行模型。这主要意味着,只要处理器必须等待数据进入,就会出现性能瓶颈(例如,在 L1/L2/L3/RAM 数据获取期间,它没有事情可做)。因此,如果您的探查器告诉您 INC 可能是问题所在;从数据吞吐量的角度来看待它,而不是着眼于原始周期计数。

Instruction              Latency1           Throughput         Execution Unit 
                                                            2 
CPUID                    0F_3H    0F_2H      0F_3H    0F_2H    0F_2H 

ADD/SUB                  1        0.5        0.5      0.5      ALU 
[...]
DEC/INC                  1        1          0.5      0.5      ALU 

From the Intel manual that you can find here it looks like the ADD/SUB instructions are half a cycle cheaper on one particular architecture. But remember that Intel uses an out-of-order execution model for it's (recent) processors. This primarily means, performance bottlenecks show up wherever the processor has to wait for data to come in (eg. it ran out of things to do during the L1/L2/L3/RAM data-fetch). So if you're profiler tells you INC might be the problem; look at it form a data-throughput point of view instead of looking at raw cycle-counts.

Instruction              Latency1           Throughput         Execution Unit 
                                                            2 
CPUID                    0F_3H    0F_2H      0F_3H    0F_2H    0F_2H 

ADD/SUB                  1        0.5        0.5      0.5      ALU 
[...]
DEC/INC                  1        1          0.5      0.5      ALU 
凡尘雨 2024-11-13 21:03:36

出于所有目的,这可能并不重要。但请考虑到 inc 使用较少的字节。

考虑以下代码:

int x = 0;
x += 2;

在不使用任何优化标志的情况下,GCC将此代码编译为:

80483ed:       c7 44 24 1c 00 00 00    movl   $0x0,0x1c(%esp)
80483f4:       00 
80483f5:       83 44 24 1c 02          addl   $0x2,0x1c(%esp)

使用-O1-O2,它变成:

c7 44 24 08 02 00 00    movl   $0x2,0x8(%esp)

有趣,不是吗?

For all purposes, it probably doesn't matter. But take into account that inc uses less bytes.

Consider the following code:

int x = 0;
x += 2;

Without using any optimization flags, GCC compiles this code into:

80483ed:       c7 44 24 1c 00 00 00    movl   $0x0,0x1c(%esp)
80483f4:       00 
80483f5:       83 44 24 1c 02          addl   $0x2,0x1c(%esp)

Using -O1 and -O2, it becomes:

c7 44 24 08 02 00 00    movl   $0x2,0x8(%esp)

Funny, isn't it?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文