使用 xor reg, reg 是否比 mov reg, 0 有优势?

发布于 2024-07-26 16:52:31 字数 1150 浏览 2 评论 0原文

在 x86 上有两种众所周知的方法将整数寄存器设置为零值。

要么

mov reg, 0

要么

xor reg, reg

有一种观点认为第二种变体更好,因为值 0 不存储在代码中,并且可以节省生成的机器代码的几个字节。 这绝对是好事 - 使用较少的指令缓存,有时可以实现更快的代码执行。 许多编译器都会生成这样的代码。

然而,异或指令与更改同一寄存器的任何早期指令之间正式存在指令间依赖性。 由于存在依赖性,后一条指令需要等待前一条指令完成,这可能会减少处理器单元的负载并损害性能。

add reg, 17
;do something else with reg here
xor reg, reg

显然,无论寄存器初始值如何,异或的结果都将完全相同。 但处理器能够识别这一点吗?

我在 VC++7 中尝试了以下测试:

const int Count = 10 * 1000 * 1000 * 1000;
int _tmain(int argc, _TCHAR* argv[])
{
    int i;
    DWORD start = GetTickCount();
    for( i = 0; i < Count ; i++ ) {
        __asm {
            mov eax, 10
            xor eax, eax
        };
    }
    DWORD diff = GetTickCount() - start;
    start = GetTickCount();
    for( i = 0; i < Count ; i++ ) {
        __asm {
            mov eax, 10
            mov eax, 0
        };
    }
    diff = GetTickCount() - start;
    return 0;
}

通过优化,两个循环花费的时间完全相同。 这是否合理地证明处理器认识到 xor reg, reg 指令对早期的 mov eax, 0 指令不存在依赖性? 有什么更好的测试来检查这一点?

There're two well-known ways to set an integer register to zero value on x86.

Either

mov reg, 0

or

xor reg, reg

There's an opinion that the second variant is better since the value 0 is not stored in the code and that saves several bytes of produced machine code. This is definitely good - less instruction cache is used and this can sometimes allow for faster code execution. Many compilers produce such code.

However there's formally an inter-instruction dependency between the xor instruction and whatever earlier instruction that changes the same register. Since there's a depedency the latter instruction needs to wait until the former completes and this could reduce the processor units load and hurt performance.

add reg, 17
;do something else with reg here
xor reg, reg

It's obvious that the result of xor will be exactly the same regardless of the initial register value. But it the processor able to recognize this?

I tried the following test in VC++7:

const int Count = 10 * 1000 * 1000 * 1000;
int _tmain(int argc, _TCHAR* argv[])
{
    int i;
    DWORD start = GetTickCount();
    for( i = 0; i < Count ; i++ ) {
        __asm {
            mov eax, 10
            xor eax, eax
        };
    }
    DWORD diff = GetTickCount() - start;
    start = GetTickCount();
    for( i = 0; i < Count ; i++ ) {
        __asm {
            mov eax, 10
            mov eax, 0
        };
    }
    diff = GetTickCount() - start;
    return 0;
}

With optimizations off both loops take exactly the same time. Does this reasonably prove that the processor recognizes that there's no dependency of xor reg, reg instruction on the earlier mov eax, 0 instruction? What could be a better test to check this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

孤单情人 2024-08-02 16:52:31

为您提供的实际答案:

Intel 64 和 IA-32 架构优化参考手册

第 3.5.1.7 节是您想要查看的地方。

简而言之,在某些情况下可能会首选异或或移动。 问题集中在依赖链和条件代码的保存上。

在基于Intel Core微架构的处理器中,许多指令可以帮助清晰的执行
软件使用这些指令将寄存器内容清零时的依赖性。

在必须保留条件代码的上下文中,将 0 移至
改为寄存器。

an actual answer for you:

Intel 64 and IA-32 Architectures Optimization Reference Manual

Section 3.5.1.7 is where you want to look.

In short there are situations where an xor or a mov may be preferred. The issues center around dependency chains and preservation of condition codes.

In processors based on Intel Core microarchitecture, a number of instructions can help clear execution
dependency when software uses these instruction to clear register content to zero.

In contexts where the condition codes must be preserved, move 0 into
the register instead.

╭⌒浅淡时光〆 2024-08-02 16:52:31

在现代 CPU 上,XOR 模式是首选。 它更小,速度更快。

实际上,较小的值确实很重要,因为在许多实际工作负载中,限制性能的主要因素之一是 i-cache 未命中。 这不会在比较这两个选项的微基准测试中捕获,但在现实世界中,它会使代码运行得稍快一些。

并且,忽略减少的 i-cache 未命中,过去许多年任何 CPU 上的 XOR 速度与 MOV 相同或更快。 还有什么比执行 MOV 指令更快呢? 根本不执行任何指令! 在最新的英特尔处理器上,调度/重命名逻辑识别 XOR 模式,“意识到”结果将为零,并且仅将寄存器指向物理零寄存器。 然后它会丢弃该指令,因为不需要执行该指令。

最终结果是 XOR 模式使用零执行资源,并且可以在最新的 Intel CPU 上每个周期“执行”四个指令。 MOV 每个周期最多可处理 3 条指令。

有关详细信息,请参阅我写的这篇博客文章:

https:// /randomascii.wordpress.com/2012/12/29/the-surprising-subtleties-of-zeroing-a-register/

大多数程序员不应该担心这一点,但编译器编写者确实必须担心,并且理解正在生成的代码是件好事,而且它太酷了!

On modern CPUs the XOR pattern is preferred. It is smaller, and faster.

Smaller actually does matter because on many real workloads one of the main factors limiting performance is i-cache misses. This wouldn't be captured in a micro-benchmark comparing the two options, but in the real world it will make code run slightly faster.

And, ignoring the reduced i-cache misses, XOR on any CPU in the last many years is the same speed or faster than MOV. What could be faster than executing a MOV instruction? Not executing any instruction at all! On recent Intel processors the dispatch/rename logic recognizes the XOR pattern, 'realizes' that the result will be zero, and just points the register at a physical zero-register. It then throws away the instruction because there is no need to execute it.

The net result is that the XOR pattern uses zero execution resources and can, on recent Intel CPUs, 'execute' four instructions per cycle. MOV tops out at three instructions per cycle.

For details see this blog post that I wrote:

https://randomascii.wordpress.com/2012/12/29/the-surprising-subtleties-of-zeroing-a-register/

Most programmers shouldn't be worrying about this, but compiler writers do have to worry, and it's good to understand the code that is being generated, and it's just frickin' cool!

萌面超妹 2024-08-02 16:52:31

x86 有可变长度指令。 MOV EAX, 0 比 XOR EAX, EAX 需要多一或两个代码空间字节。

x86 has variable-length instructions. MOV EAX, 0 requires one or two more bytes in code space than XOR EAX, EAX.

北方的韩爷 2024-08-02 16:52:31

自从我卖掉了 1966 年的 HR 旅行车后,我就再也无法修理自己的汽车了。 我对现代 CPU 也有类似的修复:-)

这实际上取决于底层的微代码或电路。 CPU 很可能可以识别“XOR Rn,Rn”并简单地将所有位清零,而无需担心内容。 当然,它也可以使用 "MOV Rn, 0" 执行相同的操作。 无论如何,一个好的编译器都会为目标平台选择最好的变体,所以如果您在汇编器中编码,这通常只是一个问题。

如果 CPU 足够智能,您的 XOR 依赖性就会消失,因为它知道该值不相关,并且无论如何都会将其设置为零(同样,这取决于所使用的实际 CPU) 。

然而,我早已不再关心代码中的几个字节或几个时钟周期了——这似乎是微优化疯了。

I stopped being able to fix my own cars after I sold my 1966 HR station wagon. I'm in a similar fix with modern CPUs :-)

It really will depend on the underlying microcode or circuitry. It's quite possible that the CPU could recognise "XOR Rn,Rn" and simply zero all bits without worrying about the contents. But of course, it may do the same thing with a "MOV Rn, 0". A good compiler will choose the best variant for the target platform anyway so this is usually only an issue if you're coding in assembler.

If the CPU is smart enough, your XOR dependency disappears since it knows the value is irrelevant and will set it to zero anyway (again this depends on the actual CPU being used).

However, I'm long past caring about a few bytes or a few clock cycles in my code - this seems like micro-optimisation gone mad.

明媚如初 2024-08-02 16:52:31

我认为在早期的体系结构中,mov eax, 0 指令也比 xor eax, eax 花费的时间稍长……记不清具体原因了。 除非您有更多的 mov ,否则我想您不太可能由于代码中存储的一个文字而导致缓存未命中。

另请注意,根据记忆,这些方法之间的标志状态并不相同,但我可能记错了。

I think on earlier architectures the mov eax, 0 instruction used to take a little longer than the xor eax, eax as well... cannot recall exactly why. Unless you have many more movs however I would imagine you're not likely to cause cache misses due to that one literal stored in the code.

Also note that from memory the status of the flags is not identical between these methods, but I may be misremembering this.

花辞树 2024-08-02 16:52:31

你在写编译器吗?

其次,您的基准测试可能行不通,因为您那里有一个分支,无论如何都可能需要花费所有时间。 (除非您的编译器为您展开循环)

无法对循环中的单个指令进行基准测试的另一个原因是所有代码都将被缓存(与真实代码不同)。 因此,通过始终将 mov eax,0 和 xor eax,eax 放在 L1 缓存中,您已经消除了 mov eax,0 和 xor eax,eax 之间的大部分大小差异。

我的猜测是,现实世界中任何可测量的性能差异都是由于占用缓存的大小差异造成的,而不是由于两个选项的执行时间造成的。

Are you writing a compiler?

And on a second note, your benchmarking probably won't work, since you have a branch in there that probably takes all the time anyway. (unless your compiler unrolls the loop for you)

Another reason that you can't benchmark a single instruction in a loop is that all your code will be cached (unlike real code). So you have taken much of the size difference between mov eax,0 and xor eax,eax out of the picture by having it in L1-cached the whole time.

My guess is that any measurable performance difference in the real world would be due to the size difference eating up the cache, and not due to execution time of the two options.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文