RSQRTSS是否打破对目标寄存器的依赖性?
使用 uica 我为以下代码制作了一个跟踪表。
cvtsi2ss xmm0, eax
addss xmm0, xmm0
您可以看到每个 cvtsi2ss
必须等待先前的迭代完成,因为它取决于 xmm0
的某些位( 32:127
)。
但是,将 CVTSI2SS
更改为 RSQRTSS
有很大的不同。
rsqrtss xmm0, xmm1
addss xmm0, xmm0
每个<代码> rsqrtss 与先前的迭代并行执行。我不明白,因为 rsqrtss
像 32:127
不变的输出一样,就像 CVTSI2SS
一样在输出寄存器上操作要完成,就像 cvtsi2ss
一样。
阅读答案后,我进行了一个简单的测试,似乎UICA似乎有一个错误。
iaca 也无法捕获输出依赖性。
如果测试代码错误,请纠正我。
__asm__ (
R"(.section .text
.balign 16
noXor:
mov eax, 0x3f800000
movd xmm1, eax
rdtscp
shl rdx, 32
or rax, rdx
mov rdi, rax
mov ecx, 1 << 30
jmp noXor_loop
.balign 16
noXor_loop:
rsqrtss xmm0, xmm1
addss xmm0, xmm0
dec ecx
jnz noXor_loop
rdtscp
shl rdx, 32
or rax, rdx
sub rax, rdi
ret
.balign 16
yesXor:
mov eax, 0x3f800000
movd xmm1, eax
rdtscp
shl rdx, 32
or rax, rdx
mov rdi, rax
mov ecx, 1 << 30
jmp yesXor_loop
.balign 16
yesXor_loop:
xorps xmm0, xmm0
rsqrtss xmm0, xmm1
addss xmm0, xmm0
dec ecx
jnz yesXor_loop
rdtscp
shl rdx, 32
or rax, rdx
sub rax, rdi
ret)"
);
unsigned long long noXor(void);
unsigned long long yesXor(void);
#include <stdio.h>
int main() {
for (int i = 0; i < 4; ++i) {
printf("noXor: %llu yesXor: %llu\n", noXor(), yesXor());
}
return 0;
}
noXor: 4978836501 yesXor: 696810039
noXor: 4971780086 yesXor: 690780109
noXor: 4977293771 yesXor: 687404710
noXor: 5499602729 yesXor: 687954399
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对真实硬件的测试,您会看到预期的结果:
rsqrtss xmm0,xmm1
具有 4循环延迟作为xmm0 - &gt; XMM0依赖关系链。 (在我的Skylake上)。这是UICA中的错误。或实际上在 https://uops.info/ 它使用的数据 - 您的跟踪包括指向
rsqrtss
他们仅测量了操作数2→1
案例的延迟,没有1→1
的条目。写入该测试后,作者可能会复制/粘贴rsqrtps
测试,而忘记添加了输出依赖项的测试。没有在真正的CPU上测试自己,只有在有实际测试测量
1→1→1
rsqrtsss 的实际测试时,您才会感到惊讶。 。没有这样的测试,正确的假设是丢失了测试(因此UICA结果是错误的),而不是延迟实际上为零。许多说明都没有任何CPU的输出依赖性,因此有理由 https://uops.info/ t所有测试。我们宁愿拥有
rsqrtps
列出为4
的延迟,而不是[0:4]
,除非有某些CPU是非零的。当没有测试数据时,UICA将不假定UICA没有输出依赖性。对于没有输出依赖项的说明,这是正常的。
,但当然
rsqrtss
应该分别从每个操作数分别测试延迟到目标。它不是零,但是从理论上讲,CPU可以使合并目标延迟转发,因此实际测试是明智的,而不是假设它与适当的来源相同。 (我不知道单个UOP允许延迟转发的任何X86 CPU,因此来自不同输入的不同潜伏期通常仅在多UOP指令中发生。与某些ARM CPU不同,在某些ARM CPU中,其FMA和/或Integer Mac单元允许延迟转发用于 BTW,您的推理是正确的,rsqrtss
确实具有依赖性,除非硬件对XMM regs进行部分注册重命名。但是没有现实的硬件可以做到这一点。
PIII和Pentium-M必须单独编写XMM的每64位一半,也许可以编写一个没有另一个的一半,但是
rsqrtss
剩下一半。得益于英特尔的短视设计选择,未经修改。 (现在我很好奇Pentium-MCVTSI2SD XMM0,EAX
或SQRTSD XMM0,XMM1
具有错误的输出依赖性。 。AVX版本
vrsqrtss
甚至需要一个额外的源操作数来合并,这些操作数可以与结果写入的目的地分开。Test on real hardware and you'll see the expected result:
rsqrtss xmm0, xmm1
has 4 cycle latency as part of the xmm0 -> xmm0 dependency chain. (On my Skylake).That's a bug in UICA. Or actually in the https://uops.info/ data it uses - your trace includes a link to the
rsqrtss
page on uops.info, where we can see they only measured latency for theoperand 2 → 1
case, no entry for1 → 1
. When that test was written, the author maybe copy/pasted thersqrtps
test and forgot to add a test for an output dependency.Without testing yourself on a real CPU, you should only be surprised if there was an actual test that measured zero latency for
1 → 1
ofrsqrtss
. Without such a test, the correct assumption is that the test is missing (and thus the UICA result is wrong), not that the latency is actually zero.Many instructions don't have output dependencies on any CPUs, so it makes sense https://uops.info/ doesn't test them all. We'd rather have
rsqrtps
latency listed as4
than[0:4]
, unless there are some CPUs where it's non-zero.It makes sense that UICA will assume no output dependency when there's no test data; that's normal for instructions without output dependencies.
But of course
rsqrtss
should test latency from each operand separately to the destination. It's non-zero, but it's at least possible in theory for a CPU to allow late forwarding for the merge target so it's wise to actually test instead of assuming it's the same as for the proper source. (I don't know of any x86 CPUs where a single uop allows late forwarding, so different latencies from different inputs usually only happens with multi-uop instructions. Unlike some ARM CPUs where their FMA and/or integer MAC units allow late forwarding for the addend.)BTW, your reasoning is correct,
rsqrtss
does have a dependency unless the hardware does partial-register renaming for XMM regs. But no real-word hardware does that.PIII and Pentium-M had to write each 64-bit half of an XMM separately, and maybe could write one half without the other, but
rsqrtss
leaves half of that low half unmodified, thanks to Intel's short-sighted design choices. (Now I'm curious whether Pentium-Mcvtsi2sd xmm0, eax
orsqrtsd xmm0, xmm1
has a false output dependency.) But current CPUs write a whole XMM register at once.The AVX version
vrsqrtss
even takes an extra source operand to merge with, which can be separate from the destination the result is written to.