零赋值与异或,第二个真的更快吗?

发布于 2024-12-09 04:31:18 字数 118 浏览 0 评论 0原文

几年前,有人向我展示了以下将变量归零的命令。

xor i,i

他告诉我,这比仅仅为其分配零要快。 这是真的吗? 编译器是否会进行优化以使代码执行此类操作?

someone showed me a few years ago the following command to zero a variable.

xor i,i

He told me that this is faster than just assigning zero to it.
Is it true?
Do compilers do optimization to get the code to perform such a thing?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

像你 2024-12-16 04:31:18

你可以自己尝试一下看看答案:

  movl $0,%eax
  xor %eax,%eax

先汇编再反汇编:

as xor.s -o xor.o
objdump -D xor.o

得到的

   0:   b8 00 00 00 00          mov    $0x0,%eax
   5:   31 c0                   xor    %eax,%eax

32 位寄存器的 mov 指令要大 2.5 倍,从 ram 加载需要更长的时间,并且消耗更多的缓存空间。过去,加载时间本身就是一个杀手,今天,内存周期时间和缓存空间可能不那么引人注目,但如果您的编译器和/或代码经常这样做,您将看到缓存的丢失空间和/或更多驱逐,以及更多、缓慢的系统内存周期。

在现代 CPU 中,较大的代码大小也会减慢解码器的速度,可能会阻止它们解码每个周期的最大数量的 x86 指令。 (例如,对于某些 CPU,16B 块中最多有 4 条指令。)

还有 在某些 x86 CPU(尤其是 Intel 的)中,xor 相对于 mov 的性能优势与以下无关代码大小,因此 x86 汇编中始终首选异或归零。


另一组实验:

void fun1 ( unsigned int *a )
{
    *a=0;
}
unsigned int fun2 ( unsigned int *a, unsigned int *b )
{
    return(*a^*b);
}
unsigned int fun3 ( unsigned int a, unsigned int b )
{
    return(a^b);
}


0000000000000000 <fun1>:
   0:   c7 07 00 00 00 00       movl   $0x0,(%rdi)
   6:   c3                      retq   
   7:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
   e:   00 00 

0000000000000010 <fun2>:
  10:   8b 06                   mov    (%rsi),%eax
  12:   33 07                   xor    (%rdi),%eax
  14:   c3                      retq   
  15:   66 66 2e 0f 1f 84 00    nopw   %cs:0x0(%rax,%rax,1)
  1c:   00 00 00 00 

0000000000000020 <fun3>:
  20:   89 f0                   mov    %esi,%eax
  22:   31 f8                   xor    %edi,%eax
  24:   c3                      retq   

沿着你的问题中显示变量 xor i,i 可能会导致什么结果的路径。由于您没有指定您所指的处理器或上下文,因此很难描绘出全貌。例如,如果您正在谈论 C 代码,您必须了解编译器对该代码执行的操作,这在很大程度上取决于函数本身中的代码,如果在进行异或时,编译器将操作数放在寄存器中并且取决于在你的编译器设置中,你可能会得到 xor eax,eax。或者编译器可以选择将其更改为 mov reg,0,或更改 some=0;到一个异或寄存器,寄存器。

还有一些需要思考的序列:

如果变量的地址已经在寄存器中:

   7:   c7 07 00 00 00 00       movl   $0x0,(%rdi)

   d:   8b 07                   mov    (%rdi),%eax
   f:   31 c0                   xor    %eax,%eax
  11:   89 07                   mov    %eax,(%rdi)

编译器将选择 mov 0 而不是 xor。如果您尝试此 C 代码,您会得到以下结果:

void funx ( unsigned int *a )
{
    *a=*a^*a;
}

编译器将其替换为移动零。获取的字节数相同,但需要访问两个内存而不是一个,并且烧毁了一个寄存器。并执行三条指令而不是一条。所以归零明显更好。

现在,如果它是字节大小并且在寄存器中:

13: b0 00                   mov    $0x0,%al
15: 30 c0                   xor    %al,%al

代码大小没有区别。 (但他们的执行方式仍然不同)。


现在,如果您正在谈论另一种处理器,假设 ARM

   0:   e3a00000    mov r0, #0
   4:   e0200000    eor r0, r0, r0
   8:   e3a00000    mov r0, #0
   c:   e5810000    str r0, [r1]
  10:   e5910000    ldr r0, [r1]
  14:   e0200000    eor r0, r0, r0
  18:   e5810000    str r0, [r1]

您不会通过使用 xor(异或,eor)保存任何内容:一条指令是一条既获取又执行的指令。异或内存中的某些内容,就像任何处理器一样,如果您在寄存器中拥有变量的地址。如果您必须将数据复制到另一个寄存器来执行异或,那么您仍然会得到两次内存访问和三个指令。如果您有一个可以执行内存到内存操作的处理器,那么零的移动会更便宜,因为您只有一次内存访问和一两个指令,具体取决于处理器。

事实上,情况比这更糟糕:eor r0, r0, r0需要对 r0 有输入依赖 (限制乱序执行),因为内存排序规则。异或置零总是产生零,但只有助于 x86 汇编的性能。


因此,最重要的是,如果您在从 8088 到现在的任何 x86 系统上的汇编程序中谈论寄存器,则异或通常会更快,因为指令更小,获取更快,如果有缓存,则占用更少的缓存,留下更多缓存同样,需要在指令中编码零的非 x86 可变指令长度处理器也将需要更长的指令、更长的取指时间、如果有缓存则消耗更多的缓存等。因此,异或是快点(通常取决于它的编码方式)。如果您有条件标志并且希望通过移动/异或来设置零标志,情况会变得更糟,您可能必须刻录正确的指令(在某些处理器上,mov 不会更改标志)。有些处理器有一个特殊的零寄存器,这不是通用的,当您使用它时,您会得到一个零,这样您就可以对这个非常常见的用例进行编码,而无需消耗更多的指令空间或消耗额外的指令周期,将零立即数加载到寄存器中。例如,msp430,移动 0x1234 会花费两个字指令,但移动 0x0000 或 0x0001 以及其他一些常量可以编码在单个指令字中。如果您正在谈论 RAM 中的变量,则所有处理器都会对内存进行双重命中,读取-修改-写入两个内存周期不计算指令获取,并且如果读取导致缓存行填充,情况会变得更糟(写入将是非常快),但如果没有读取,写入可能会直接通过缓存并执行得非常快,因为处理器可以在写入并行进行时保持运行(有时您会获得性能增益,有时不会,如果您调整为了它)。 x86 和可能较旧的处理器是您看到异或而不是移动零的习惯的原因。对于那些特定的优化,性能增益今天仍然存在,系统内存仍然非常慢,任何额外的内存周期都是昂贵的,同样,任何被丢弃的缓存都是昂贵的。半途而废的编译器,甚至 gcc,都会检测到异或 i,i 等价于 i=0,并根据具体情况选择更好的(在平均系统上)指令序列。

获取迈克尔·亚伯拉什 (Michael Abrash) 所著的《集会之禅》 (Zen of Assembly) 的副本。优质的二手副本价格合理(低于 50 美元),即使您购买 80 美元的副本也是非常值得的。尝试超越特定的 8088“骑自行车者”并理解他试图教授的一般思维过程。然后花尽可能多的时间反汇编代码,最好是针对许多不同的处理器。应用你所学到的...

You can try this yourself to see the answer:

  movl $0,%eax
  xor %eax,%eax

assemble then disassemble:

as xor.s -o xor.o
objdump -D xor.o

And get

   0:   b8 00 00 00 00          mov    $0x0,%eax
   5:   31 c0                   xor    %eax,%eax

the mov instruction for a 32 bit register is 2.5 times larger, takes longer to load from ram and consumes that much more cache space. Back in the day the load time alone was a killer, today the memory cycle time and cache space could be argued to be not that noticeable, but it is if your compiler and/or code does this too often you will see the loss of cache space and or more evictions, and more, slow, system memory cycles.

In modern CPUs, larger code-size can also slow down the decoders, maybe preventing them from decoding their maximum number of x86 instructions per cycle. (e.g. up to 4 instructions in a 16B block for some CPUs.)

There are also performance advantages to xor over mov in some x86 CPUs (especially Intel's) that have nothing to do with code-size, so xor-zeroing is always preferred in x86 assembly.


Another set of experiments:

void fun1 ( unsigned int *a )
{
    *a=0;
}
unsigned int fun2 ( unsigned int *a, unsigned int *b )
{
    return(*a^*b);
}
unsigned int fun3 ( unsigned int a, unsigned int b )
{
    return(a^b);
}


0000000000000000 <fun1>:
   0:   c7 07 00 00 00 00       movl   $0x0,(%rdi)
   6:   c3                      retq   
   7:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
   e:   00 00 

0000000000000010 <fun2>:
  10:   8b 06                   mov    (%rsi),%eax
  12:   33 07                   xor    (%rdi),%eax
  14:   c3                      retq   
  15:   66 66 2e 0f 1f 84 00    nopw   %cs:0x0(%rax,%rax,1)
  1c:   00 00 00 00 

0000000000000020 <fun3>:
  20:   89 f0                   mov    %esi,%eax
  22:   31 f8                   xor    %edi,%eax
  24:   c3                      retq   

Heads down the path of showing what to variables xor i,i as in your question might lead to. Since you didnt specify what processor or what context you were referring it is difficult to paint the whole picture. If for example you are talking about C code, you have to understand what compilers do to that code, and that depends heavily on the code in the function itself, if at the time of your xor the compiler has the operand in a register and depending on your compiler settings you might get the xor eax,eax. or the compiler can choose to change that to a mov reg,0, or change a something=0; to an xor reg,reg.

Some more sequences to ponder:

if the address to the variable is already in a register:

   7:   c7 07 00 00 00 00       movl   $0x0,(%rdi)

   d:   8b 07                   mov    (%rdi),%eax
   f:   31 c0                   xor    %eax,%eax
  11:   89 07                   mov    %eax,(%rdi)

The compiler will choose the mov zero instead of the xor. Which is what you would get if you tried this C code:

void funx ( unsigned int *a )
{
    *a=*a^*a;
}

The compiler replaces it with a move zero. Same number of bytes fetched, but two memory accessed needed instead of one, and a register burned. and three instructions to execute instead of one. So the move zero is noticeably better.

Now if it is byte sized and in a register:

13: b0 00                   mov    $0x0,%al
15: 30 c0                   xor    %al,%al

no difference in code size. (But they still execute differently).


Now if you were talking about another processor, lets say ARM

   0:   e3a00000    mov r0, #0
   4:   e0200000    eor r0, r0, r0
   8:   e3a00000    mov r0, #0
   c:   e5810000    str r0, [r1]
  10:   e5910000    ldr r0, [r1]
  14:   e0200000    eor r0, r0, r0
  18:   e5810000    str r0, [r1]

You dont save anything by using the xor (exclusive or, eor): one instruction is one instruction both fetched and execution. xoring something in ram, just like any processor if you have the address of the variable in a register. If you have to copy the data to another register to perform the xor, then you still end up with two memory accesses and three instructions. If you have a processor that can do memory to memory the move of zero is cheaper because you only have the one memory access and one or two instructions depending on the processor.

In fact it's worse than that: eor r0, r0, r0 is required to have an input dependency on r0 (limiting out-of-order execution), because of memory-ordering rules. Xor-zeroing always produces zero, but only helps performance in x86 assembly.


So the bottom line is it depends, if you are talking registers in assembler on an x86 system anywhere from 8088 to the present the xor is often faster because the instruction is smaller, fetches faster, takes less cache if you have one, leaves more cache for other code, etc. Likewise non-x86 variable instruction length processors that require the zero to be encoded in the instruction will also require a longer instruction, longer fetch time, more cache consumed if there is a cache, etc. So the xor is faster (usually, depends on how it encodes). It gets much worse if you have conditional flags and you want that move/xor to set the zero flag, you may have to burn the right instruction (on some processors the mov does not change the flags). Some processors have a special zero register, that is not general purpose, when you use it you get a zero that way you can encode this very common use case without burning more instruction space or burning an extra instruction cycle loading a zero immediate into a register. msp430 for example, a move of 0x1234 would cost you a two word instruction, but move 0x0000 or 0x0001 and a few other constants can be encoded in a single instruction word. All processors will have the double hit to memory if you are talking about a variable in ram, read-modify-write two memory cycles not counting the instruction fetches, and gets worse if the read causes a cache line fill (the write would then be very fast), but without the read the write only might pass right by the cache and execute very fast as the processor could keep running while the write was going on in parallel (sometimes you get that performance gain, sometimes not, always if you tune for it). The x86 and likely older processors are the reason why you see the habit of xoring instead of moving zero. The performance gain is still there today for those specific optimizations, system memory is still extremely slow and any extra memory cycles are costly, likewise any cache that is thrown away is costly. Halfway decent compilers, even gcc, will detect an xor i,i as being equivalent to i=0 and choose on a case by case basis the better (on an average system) instruction sequence.

Get a copy of the Zen of Assembly by Michael Abrash. Good, used copies are available at a reasonable price (under $50), even if you go for the $80 copies it is well worth it. Try to look beyond the particular 8088 "cycle eaters" and understand the general thought process he is trying to teach. Then spend as much time as you can disassembling your code, ideally for many different processors. Apply what you have learned...

夢归不見 2024-12-16 04:31:18

在较旧的 CPU 上(但根据评论,Pentium Pro 之后的 CPU)曾经是这种情况,但是,如今大多数现代 CPU 都有用于零分配(寄存器和对齐变量)的特殊热路径,这应该会产生相同的性能。大多数现代编译器倾向于使用两者的混合,具体取决于周围的代码(较旧的 MSVC 编译器总是在优化构建中使用 XOR ,并且它仍然使用 XOR相当多,但在某些情况下也会使用 MOV reg,0)。

这是一个非常微观的优化,所以说实话,你可以做最适合你的事情,除非你有由于寄存器依赖性而滞后的紧密循环。但应该注意的是,使用 XOR 在大多数情况下占用的空间较少,这对于嵌入式设备或当您尝试对齐分支目标时非常有用。

这假设您主要指的是 x86 及其衍生物,在这一点上@Pascal 给了我一个想法,将其作为基础的技术参考。英特尔优化手册有两个部分涉及此问题,即 2.1.3.1 Dependency Breaking Idioms3.5.1.7 Clearing Registers and Dependency Breaking Idioms。这两节基本主张使用基于 XOR 的指令进行任何形式的寄存器清除,因为它具有破坏依赖性的性质(消除了延迟)。但在需要保留条件代码的部分中,首选将 0 移动到寄存器中。

On older CPU's (but those after the Pentium Pro, as per the comments) this used to be the case, however, most modern CPU these days have special hot paths for zero assignment (of registers and well aligned variables) that should yield equivalent performance. most modern compilers will tend to use a mix of the two, depending on the surrounding code (older MSVC compilers would always use XOR in optimized builds, and it still does use XOR quite a bit, but will also use MOV reg,0 in certain circumstances).

This is very much of a micro optimization, so tbh, you can just do what ever suites you best, unless you have tight loops that are lagging due to register dependencies. it should be noted however, that use XOR takes up less space most of the time, which is great for embedded devices or when your are try to align a branch target.

this assumes that you are mainly referring to x86 and its derivatives, on that note @Pascal gave me the idea to put in the technical references that for the basis for this. The Intel Optimization manual has two sections dealing with this, namely, 2.1.3.1 Dependancy Breaking Idioms and 3.5.1.7 Clearing Registers and Dependancy Breaking Idioms. These two sections basical advocate using XOR based instructions for any form of register clearing due its dependancy breaking nature (which removes latency). But in sections where condition codes need preserving, MOVing 0 into a register is prefered.

虚拟世界 2024-12-16 04:31:18

由于异或指令较短并且预取队列对内存带宽的限制,因此在 8088 上(以及在较小程度上 8086 上)确实如此。

Definitely was true on the 8088 (and to a lesser degree the 8086) due to the xor instruction being shorter and the prefetch queue to memory bandwidth limitations.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文