混合 SSE 整数/浮点 SIMD 指令时,性能是否会受到影响

发布于 2024-10-17 03:52:08 字数 526 浏览 3 评论 0原文

我最近经常以内在函数的形式使用 x86 SIMD 指令 (SSE1234)。令我沮丧的是,SSE ISA 有几个简单的指令,仅适用于浮点数或整数,但理论上对两者的性能应该相同。例如,浮点向量和双精度向量都有从地址(movhpsmovhpd)加载 128 位向量的高 64 位的指令,但整数没有这样的指令向量。

我的问题:

在整数向量上使用浮点指令(例如使用movhps将数据加载到整数向量时)是否有任何理由预期性能会受到影响?

我写了几个测试来检查这一点,但我认为它们的结果不可信。编写一个正确的测试来探索此类事情的所有极端情况确实很困难,特别是当这里很可能涉及指令调度时。

相关问题:

其他类似的东西也有几个基本相同的指令。例如,我可以使用 pororpsorpd 进行按位或运算。谁能解释一下这些附加说明的目的是什么?我想这可能与应用于每条指令的不同调度算法有关。

I've used x86 SIMD instructions (SSE1234) in the form of intrinsics quite a lot lately. What I found frustrating is that the SSE ISA has several simple instructions that are available only for floats or only for integers, but in theory should perform equally for both. For example, both float and double vectors have instructions to load higher 64bits of a 128-bit vector from an address (movhps, movhpd), but there's no such instruction for integer vectors.

My question:

Is there any reasons to expect a performance hit when using floating point instructions on integer vectors, e.g. using movhps to load data to an integer vector?

I wrote several tests to check that, but I suppose their results are not credible. It's really hard to write a correct test that explores all corner cases for such things, especially when the instruction scheduling is most probably involved here.

Related question:

Other trivially similar things also have several instructions that do basically the same. For example I can do bitwise OR with por, orps or orpd. Can anyone explain what's the purpose of these additional instructions? I guess this might be related to different scheduling algorithms applied to each instruction.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

不必了 2024-10-24 03:52:08

来自专家(显然不是我:P): http://www.agner.org/optimize/ optimization_assemble.pdf [13.2 将向量指令与预期用途之外的其他类型的数据一起使用(第 118-119 页)]:

在某些处理器上使用错误类型的指令会受到惩罚。这是
因为处理器可能有不同的数据总线或不同的整数执行单元
和浮点数据。在整数和浮点单元之间移动数据可能需要
一个或多个时钟周期取决于处理器,如表 13.2 所列。

处理器旁路延迟,时钟周期 
  Intel Core 2 及更早版本 1 
  英特尔 Nehalem 2 
  英特尔 Sandy Bridge 及之后的 0-1 
  英特尔凌动0 
  AMD 2 
  威盛纳米2-3 
表 13.2。整数和浮点执行单元之间的数据旁路延迟 

From an expert (obviously not me :P): http://www.agner.org/optimize/optimizing_assembly.pdf [13.2 Using vector instructions with other types of data than they are intended for (pages 118-119)]:

There is a penalty for using the wrong type of instructions on some processors. This is
because the processor may have different data buses or different execution units for integer
and floating point data. Moving data between the integer and floating point units can take
one or more clock cycles depending on the processor, as listed in table 13.2.

Processor                       Bypass delay, clock cycles 
  Intel Core 2 and earlier        1 
  Intel Nehalem                   2 
  Intel Sandy Bridge and later    0-1 
  Intel Atom                      0 
  AMD                             2 
  VIA Nano                        2-3 
Table 13.2. Data bypass delays between integer and floating point execution units 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文