混合 SSE 整数/浮点 SIMD 指令时，性能是否会受到影响

发布于 2024-10-17 03:52:08 字数 526 浏览 12 评论 0原文

我最近经常以内在函数的形式使用 x86 SIMD 指令 (SSE1234)。令我沮丧的是，SSE ISA 有几个简单的指令，仅适用于浮点数或整数，但理论上对两者的性能应该相同。例如，浮点向量和双精度向量都有从地址（movhps、movhpd）加载 128 位向量的高 64 位的指令，但整数没有这样的指令向量。

我的问题：

在整数向量上使用浮点指令（例如使用movhps将数据加载到整数向量时）是否有任何理由预期性能会受到影响？

我写了几个测试来检查这一点，但我认为它们的结果不可信。编写一个正确的测试来探索此类事情的所有极端情况确实很困难，特别是当这里很可能涉及指令调度时。

相关问题：

其他类似的东西也有几个基本相同的指令。例如，我可以使用 por、orps 或 orpd 进行按位或运算。谁能解释一下这些附加说明的目的是什么？我想这可能与应用于每条指令的不同调度算法有关。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不必了 2024-10-24 03:52:08

来自专家（显然不是我：P）： http://www.agner.org/optimize/ optimization_assemble.pdf [13.2 将向量指令与预期用途之外的其他类型的数据一起使用（第 118-119 页）]：

在某些处理器上使用错误类型的指令会受到惩罚。这是
因为处理器可能有不同的数据总线或不同的整数执行单元
和浮点数据。在整数和浮点单元之间移动数据可能需要
一个或多个时钟周期取决于处理器，如表 13.2 所列。
处理器旁路延迟，时钟周期 
  Intel Core 2 及更早版本 1 
  英特尔 Nehalem 2 
  英特尔 Sandy Bridge 及之后的 0-1 
  英特尔凌动0 
  AMD 2 
  威盛纳米2-3 
表 13.2。整数和浮点执行单元之间的数据旁路延迟 

From an expert (obviously not me :P): http://www.agner.org/optimize/optimizing_assembly.pdf [13.2 Using vector instructions with other types of data than they are intended for (pages 118-119)]:

There is a penalty for using the wrong type of instructions on some processors. This is
because the processor may have different data buses or different execution units for integer
and floating point data. Moving data between the integer and floating point units can take
one or more clock cycles depending on the processor, as listed in table 13.2.
Processor                       Bypass delay, clock cycles 
  Intel Core 2 and earlier        1 
  Intel Nehalem                   2 
  Intel Sandy Bridge and later    0-1 
  Intel Atom                      0 
  AMD                             2 
  VIA Nano                        2-3 
Table 13.2. Data bypass delays between integer and floating point execution units 

回复收藏 0 原文

~没有更多了~