SSE比FPU慢?

发布于 2024-12-26 11:56:12 字数 872 浏览 2 评论 0原文

我有一大段代码,其主体部分包含这段代码:

result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1);

我已将其矢量化如下(一切都已经是 float):

__m128 r = _mm_mul_ps(_mm_set_ps(ny, nx, ny, nx),
                      _mm_set_ps(ny, nx, m_Ly, m_Lx));
__declspec(align(16)) int asInt[4] = {
    _mm_extract_ps(r,0), _mm_extract_ps(r,1),
    _mm_extract_ps(r,2), _mm_extract_ps(r,3)
};
float (&res)[4] = reinterpret_cast<float (&)[4]>(asInt);
result = (res[0] + res[1] + m_Lz) / sqrt(res[2] + res[3] + 1);

结果是正确的;但是,我的基准测试显示矢量化版本较慢

  • 非矢量化版本需要 3750 毫秒
  • 矢量化版本需要 4050 毫秒
  • 结果设置为0 直接(并完全删除这部分代码)将整个过程减少到 2500 毫秒

鉴于矢量化版本仅包含一组 SSE 乘法(而不是四个单独的 FPU)乘法),为什么速度较慢? FPU 确实比 SSE 快吗?或者这里是否存在令人困惑的变量?

(我使用的是移动 Core i5。)

I have a large piece of code, part of whose body contains this piece of code:

result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1);

which I have vectorized as follows (everything is already a float):

__m128 r = _mm_mul_ps(_mm_set_ps(ny, nx, ny, nx),
                      _mm_set_ps(ny, nx, m_Ly, m_Lx));
__declspec(align(16)) int asInt[4] = {
    _mm_extract_ps(r,0), _mm_extract_ps(r,1),
    _mm_extract_ps(r,2), _mm_extract_ps(r,3)
};
float (&res)[4] = reinterpret_cast<float (&)[4]>(asInt);
result = (res[0] + res[1] + m_Lz) / sqrt(res[2] + res[3] + 1);

The result is correct; however, my benchmarking shows that the vectorized version is slower:

  • The non-vectorized version takes 3750 ms
  • The vectorized version takes 4050 ms
  • Setting result to 0 directly (and removing this part of the code entirely) reduces the entire process to 2500 ms

Given that the vectorized version only contains one set of SSE multiplications (instead of four individual FPU multiplications), why is it slower? Is the FPU indeed faster than SSE, or is there a confounding variable here?

(I'm on a mobile Core i5.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

千秋岁 2025-01-02 11:56:12

您花费了大量时间使用 _mm_set_ps_mm_extract_ps 将标量值移入 SSE 寄存器 - 这会生成大量指令,其执行时间将远远超过使用 _mm_mul_ps 的任何好处。查看生成的汇编输出,了解除了单个 MULPS 指令之外还生成了多少代码。

要正确对其进行矢量化,您需要使用 128 位 SSE 加载和存储 (_mm_load_ps/_mm_store_ps),然后使用 SSE 洗牌指令在需要的地方在寄存器内移动元素。

还需要注意一点 - 现代 CPU(例如 Core i5、Core i7)具有两个标量 FPU,每个时钟可以发出 2 个浮点乘法。因此,SSE 对单精度浮点的潜在好处最多也只有 2 倍。如果您有过多的“内务管理”说明,很容易失去大部分/全部的 2 倍好处,就像这里的情况一样。

You are spending a lot of time moving scalar values to/from SSE registers with _mm_set_ps and _mm_extract_ps - this is generating a lot of instructions, the execution time of which will far outweigh any benefit from using _mm_mul_ps. Take a look at the generated assembly output to see how much code is being generated in addition to the single MULPS instruction.

To vectorize this properly you need to use 128 bit SSE loads and stores (_mm_load_ps/_mm_store_ps) and then use SSE shuffle instructions to move elements around within registers where needed.

One further point to note - modern CPUs such as Core i5, Core i7, have two scalar FPUs and can issue 2 floating point multiplies per clock. The potential benefit from SSE for single precision floating point is therefore only 2x at best. It's easy to lose most/all of this 2x benefit if you have excessive "housekeeping" instructions, as is the case here.

人│生佛魔见 2025-01-02 11:56:12

有几个问题:

  1. 在此类操作中使用 SSE 指令不会带来太多好处,因为 SSE 指令在并行操作(即同时乘以多个值)方面应该更好。您所做的是滥用 SSE
  2. ,不要设置值,而是使用指向数组中第一个值的指针,但是您的值不在数组中,
  3. 不要提取值并将其复制到数组中。这也是对上交所的滥用。结果应该在一个数组中。

There are several problems :

  1. You will not see much benefits from using SSE instructions in such operations, because the SSE instructions are supposed to be better on parallel operations (that is, multiplying several values at the same time). What you did is a misuse of the SSE
  2. do not set the values, use the pointer to the 1st value in the array, but then your values are not in the array
  3. do not extract and copy values into the array. That is also a misuse of SSE. The result is supposed to be in an array.
萝莉病 2025-01-02 11:56:12

我的看法是,在使用 FPU 加载下一个值时,处理器有时间计算第一个乘法。 SSE 必须首先加载所有值。

My take would be that the processor has the time to compute the first multiplication when using the FPU while loading the next values. The SSE has to load all the values first.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文