使用 xor reg, reg 是否比 mov reg, 0 有优势？

发布于 2024-07-26 16:52:31 字数 1150 浏览 9 评论 0原文

在 x86 上有两种众所周知的方法将整数寄存器设置为零值。

要么

mov reg, 0

要么

xor reg, reg

有一种观点认为第二种变体更好，因为值 0 不存储在代码中，并且可以节省生成的机器代码的几个字节。这绝对是好事 - 使用较少的指令缓存，有时可以实现更快的代码执行。许多编译器都会生成这样的代码。

然而，异或指令与更改同一寄存器的任何早期指令之间正式存在指令间依赖性。由于存在依赖性，后一条指令需要等待前一条指令完成，这可能会减少处理器单元的负载并损害性能。

add reg, 17
;do something else with reg here
xor reg, reg

显然，无论寄存器初始值如何，异或的结果都将完全相同。但处理器能够识别这一点吗？

我在 VC++7 中尝试了以下测试：

const int Count = 10 * 1000 * 1000 * 1000;
int _tmain(int argc, _TCHAR* argv[])
{
    int i;
    DWORD start = GetTickCount();
    for( i = 0; i < Count ; i++ ) {
        __asm {
            mov eax, 10
            xor eax, eax
        };
    }
    DWORD diff = GetTickCount() - start;
    start = GetTickCount();
    for( i = 0; i < Count ; i++ ) {
        __asm {
            mov eax, 10
            mov eax, 0
        };
    }
    diff = GetTickCount() - start;
    return 0;
}

通过优化，两个循环花费的时间完全相同。这是否合理地证明处理器认识到 xor reg, reg 指令对早期的 mov eax, 0 指令不存在依赖性？有什么更好的测试来检查这一点？

原文

There're two well-known ways to set an integer register to zero value on x86.

Either

mov reg, 0

xor reg, reg

There's an opinion that the second variant is better since the value 0 is not stored in the code and that saves several bytes of produced machine code. This is definitely good - less instruction cache is used and this can sometimes allow for faster code execution. Many compilers produce such code.

However there's formally an inter-instruction dependency between the xor instruction and whatever earlier instruction that changes the same register. Since there's a depedency the latter instruction needs to wait until the former completes and this could reduce the processor units load and hurt performance.

add reg, 17
;do something else with reg here
xor reg, reg

It's obvious that the result of xor will be exactly the same regardless of the initial register value. But it the processor able to recognize this?

I tried the following test in VC++7:

const int Count = 10 * 1000 * 1000 * 1000;
int _tmain(int argc, _TCHAR* argv[])
{
    int i;
    DWORD start = GetTickCount();
    for( i = 0; i < Count ; i++ ) {
        __asm {
            mov eax, 10
            xor eax, eax
        };
    }
    DWORD diff = GetTickCount() - start;
    start = GetTickCount();
    for( i = 0; i < Count ; i++ ) {
        __asm {
            mov eax, 10
            mov eax, 0
        };
    }
    diff = GetTickCount() - start;
    return 0;
}

With optimizations off both loops take exactly the same time. Does this reasonably prove that the processor recognizes that there's no dependency of xor reg, reg instruction on the earlier mov eax, 0 instruction? What could be a better test to check this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤单情人 2024-08-02 16:52:31

为您提供的实际答案：

Intel 64 和 IA-32 架构优化参考手册

第 3.5.1.7 节是您想要查看的地方。

简而言之，在某些情况下可能会首选异或或移动。问题集中在依赖链和条件代码的保存上。

在基于Intel Core微架构的处理器中，许多指令可以帮助清晰的执行
软件使用这些指令将寄存器内容清零时的依赖性。

在必须保留条件代码的上下文中，将 0 移至
改为寄存器。

回复收藏 0 原文

╭⌒浅淡时光〆 2024-08-02 16:52:31

在现代 CPU 上，XOR 模式是首选。它更小，速度更快。

实际上，较小的值确实很重要，因为在许多实际工作负载中，限制性能的主要因素之一是 i-cache 未命中。这不会在比较这两个选项的微基准测试中捕获，但在现实世界中，它会使代码运行得稍快一些。

并且，忽略减少的 i-cache 未命中，过去许多年任何 CPU 上的 XOR 速度与 MOV 相同或更快。还有什么比执行 MOV 指令更快呢？根本不执行任何指令！在最新的英特尔处理器上，调度/重命名逻辑识别 XOR 模式，“意识到”结果将为零，并且仅将寄存器指向物理零寄存器。然后它会丢弃该指令，因为不需要执行该指令。

最终结果是 XOR 模式使用零执行资源，并且可以在最新的 Intel CPU 上每个周期“执行”四个指令。 MOV 每个周期最多可处理 3 条指令。

有关详细信息，请参阅我写的这篇博客文章：

https:// /randomascii.wordpress.com/2012/12/29/the-surprising-subtleties-of-zeroing-a-register/

大多数程序员不应该担心这一点，但编译器编写者确实必须担心，并且理解正在生成的代码是件好事，而且它太酷了！

回复收藏 0 原文