RSQRTSS是否打破对目标寄存器的依赖性?

发布于 2025-02-04 05:11:11 字数 2301 浏览 2 评论 0 原文

使用 uica 我为以下代码制作了一个跟踪表。

cvtsi2ss xmm0, eax
addss xmm0, xmm0

您可以看到每个 cvtsi2ss 必须等待先前的迭代完成,因为它取决于 xmm0 的某些位( 32:127 )。

但是,将 CVTSI2SS 更改为 RSQRTSS 有很大的不同。

rsqrtss xmm0, xmm1
addss xmm0, xmm0

每个<代码> rsqrtss 与先前的迭代并行执行。我不明白,因为 rsqrtss 32:127 不变的输出一样,就像 CVTSI2SS 一样在输出寄存器上操作要完成,就像 cvtsi2ss 一样。


阅读答案后,我进行了一个简单的测试,似乎UICA似乎有一个错误。

iaca 也无法捕获输出依赖性。

如果测试代码错误,请纠正我。

__asm__ (
    R"(.section .text
    .balign 16
noXor:
    mov eax, 0x3f800000
    movd xmm1, eax
    rdtscp
    shl rdx, 32
    or rax, rdx
    mov rdi, rax
    mov ecx, 1 << 30
    jmp noXor_loop
    .balign 16
noXor_loop:
    rsqrtss xmm0, xmm1
    addss xmm0, xmm0
    dec ecx
    jnz noXor_loop
    rdtscp
    shl rdx, 32
    or rax, rdx
    sub rax, rdi
    ret
    .balign 16
yesXor:
    mov eax, 0x3f800000
    movd xmm1, eax
    rdtscp
    shl rdx, 32
    or rax, rdx
    mov rdi, rax
    mov ecx, 1 << 30
    jmp yesXor_loop
    .balign 16
yesXor_loop:
    xorps xmm0, xmm0
    rsqrtss xmm0, xmm1
    addss xmm0, xmm0
    dec ecx
    jnz yesXor_loop
    rdtscp
    shl rdx, 32
    or rax, rdx
    sub rax, rdi
    ret)"
);

unsigned long long noXor(void);
unsigned long long yesXor(void);

#include <stdio.h>

int main() {
    for (int i = 0; i < 4; ++i) {
        printf("noXor: %llu yesXor: %llu\n", noXor(), yesXor());
    }
    return 0;
}
noXor: 4978836501 yesXor: 696810039
noXor: 4971780086 yesXor: 690780109
noXor: 4977293771 yesXor: 687404710
noXor: 5499602729 yesXor: 687954399

Using uiCA I produced a trace table for the following code.

cvtsi2ss xmm0, eax
addss xmm0, xmm0

https://uica.uops.info/tmp/780bce9e56ee4a718d5369deb1326215_trace.html

You can see that each cvtsi2ss has to wait for the previous iteration to finish because it depends on some bits (32:127) of xmm0.

However, changing cvtsi2ss to rsqrtss makes a big difference.

rsqrtss xmm0, xmm1
addss xmm0, xmm0

https://uica.uops.info/tmp/8897a7d45c8348e68279aea4d0b18e15_trace.html

Each rsqrtss executes in parallel with the previous iteration. I don't understand because rsqrtss produces an output with bits 32:127 unchanged, just like cvtsi2ss, so I think it should wait for any operation on the output register to finish, just like cvtsi2ss did.


After reading the answer, I ran a simple test, and it seems sure that uiCA has a bug.

IACA also fails to catch the output dependency.

Please correct me if the test code is wrong.

__asm__ (
    R"(.section .text
    .balign 16
noXor:
    mov eax, 0x3f800000
    movd xmm1, eax
    rdtscp
    shl rdx, 32
    or rax, rdx
    mov rdi, rax
    mov ecx, 1 << 30
    jmp noXor_loop
    .balign 16
noXor_loop:
    rsqrtss xmm0, xmm1
    addss xmm0, xmm0
    dec ecx
    jnz noXor_loop
    rdtscp
    shl rdx, 32
    or rax, rdx
    sub rax, rdi
    ret
    .balign 16
yesXor:
    mov eax, 0x3f800000
    movd xmm1, eax
    rdtscp
    shl rdx, 32
    or rax, rdx
    mov rdi, rax
    mov ecx, 1 << 30
    jmp yesXor_loop
    .balign 16
yesXor_loop:
    xorps xmm0, xmm0
    rsqrtss xmm0, xmm1
    addss xmm0, xmm0
    dec ecx
    jnz yesXor_loop
    rdtscp
    shl rdx, 32
    or rax, rdx
    sub rax, rdi
    ret)"
);

unsigned long long noXor(void);
unsigned long long yesXor(void);

#include <stdio.h>

int main() {
    for (int i = 0; i < 4; ++i) {
        printf("noXor: %llu yesXor: %llu\n", noXor(), yesXor());
    }
    return 0;
}
noXor: 4978836501 yesXor: 696810039
noXor: 4971780086 yesXor: 690780109
noXor: 4977293771 yesXor: 687404710
noXor: 5499602729 yesXor: 687954399

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

耀眼的星火 2025-02-11 05:11:11

对真实硬件的测试,您会看到预期的结果: rsqrtss xmm0,xmm1 具有 4循环延迟作为xmm0 - &gt; XMM0依赖关系链。 (在我的Skylake上)。

这是UICA中的错误。或实际上在 https://uops.info/ 它使用的数据 - 您的跟踪包括指向 rsqrtss 他们仅测量了操作数2→1 案例的延迟,没有 1→1 的条目。写入该测试后,作者可能会复制/粘贴 rsqrtps 测试,而忘记添加了输出依赖项的测试。

没有在真正的CPU上测试自己,只有在有实际测试测量 1→1→1 rsqrtsss 的实际测试时,您才会感到惊讶。 。没有这样的测试,正确的假设是丢失了测试(因此UICA结果是错误的),而不是延迟实际上为零。

许多说明都没有任何CPU的输出依赖性,因此有理由 https://uops.info/ t所有测试。我们宁愿拥有 rsqrtps 列出为 4 的延迟,而不是 [0:4] ,除非有某些CPU是非零的。

当没有测试数据时,UICA将不假定UICA没有输出依赖性。对于没有输出依赖项的说明,这是正常的。

,但当然 rsqrtss 应该分别从每个操作数分别测试延迟到目标。它不是零,但是从理论上讲,CPU可以使合并目标延迟转发,因此实际测试是明智的,而不是假设它与适当的来源相同。 (我不知道单个UOP允许延迟转发的任何X86 CPU,因此来自不同输入的不同潜伏期通常仅在多UOP指令中发生。与某些ARM CPU不同,在某些ARM CPU中,其FMA和/或Integer Mac单元允许延迟转发用于 BTW,您的推理是正确的, rsqrtss


确实具有依赖性,除非硬件对XMM regs进行部分注册重命名。但是没有现实的硬件可以做到这一点。

PIII和Pentium-M必须单独编写XMM的每64位一半,也许可以编写一个没有另一个的一半,但是 rsqrtss 剩下一半。得益于英特尔的短视设计选择,未经修改。 (现在我很好奇Pentium-M CVTSI2SD XMM0,EAX SQRTSD XMM0,XMM1 具有错误的输出依赖性。 。

AVX版本 vrsqrtss 甚至需要一个额外的源操作数来合并,这些操作数可以与结果写入的目的地分开。

Test on real hardware and you'll see the expected result: rsqrtss xmm0, xmm1 has 4 cycle latency as part of the xmm0 -> xmm0 dependency chain. (On my Skylake).

That's a bug in UICA. Or actually in the https://uops.info/ data it uses - your trace includes a link to the rsqrtss page on uops.info, where we can see they only measured latency for the operand 2 → 1 case, no entry for 1 → 1. When that test was written, the author maybe copy/pasted the rsqrtps test and forgot to add a test for an output dependency.

Without testing yourself on a real CPU, you should only be surprised if there was an actual test that measured zero latency for 1 → 1 of rsqrtss. Without such a test, the correct assumption is that the test is missing (and thus the UICA result is wrong), not that the latency is actually zero.

Many instructions don't have output dependencies on any CPUs, so it makes sense https://uops.info/ doesn't test them all. We'd rather have rsqrtps latency listed as 4 than [0:4], unless there are some CPUs where it's non-zero.

It makes sense that UICA will assume no output dependency when there's no test data; that's normal for instructions without output dependencies.

But of course rsqrtss should test latency from each operand separately to the destination. It's non-zero, but it's at least possible in theory for a CPU to allow late forwarding for the merge target so it's wise to actually test instead of assuming it's the same as for the proper source. (I don't know of any x86 CPUs where a single uop allows late forwarding, so different latencies from different inputs usually only happens with multi-uop instructions. Unlike some ARM CPUs where their FMA and/or integer MAC units allow late forwarding for the addend.)


BTW, your reasoning is correct, rsqrtss does have a dependency unless the hardware does partial-register renaming for XMM regs. But no real-word hardware does that.

PIII and Pentium-M had to write each 64-bit half of an XMM separately, and maybe could write one half without the other, but rsqrtss leaves half of that low half unmodified, thanks to Intel's short-sighted design choices. (Now I'm curious whether Pentium-M cvtsi2sd xmm0, eax or sqrtsd xmm0, xmm1 has a false output dependency.) But current CPUs write a whole XMM register at once.

The AVX version vrsqrtss even takes an extra source operand to merge with, which can be separate from the destination the result is written to.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文