SSE比FPU慢？

发布于 2024-12-26 11:56:12 字数 872 浏览 6 评论 0原文

我有一大段代码，其主体部分包含这段代码：

result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1);

我已将其矢量化如下（一切都已经是 float）：

__m128 r = _mm_mul_ps(_mm_set_ps(ny, nx, ny, nx),
                      _mm_set_ps(ny, nx, m_Ly, m_Lx));
__declspec(align(16)) int asInt[4] = {
    _mm_extract_ps(r,0), _mm_extract_ps(r,1),
    _mm_extract_ps(r,2), _mm_extract_ps(r,3)
};
float (&res)[4] = reinterpret_cast<float (&)[4]>(asInt);
result = (res[0] + res[1] + m_Lz) / sqrt(res[2] + res[3] + 1);

结果是正确的；但是，我的基准测试显示矢量化版本较慢：

非矢量化版本需要 3750 毫秒
矢量化版本需要 4050 毫秒
将结果设置为0 直接（并完全删除这部分代码）将整个过程减少到 2500 毫秒

鉴于矢量化版本仅包含一组 SSE 乘法（而不是四个单独的 FPU）乘法），为什么速度较慢？ FPU 确实比 SSE 快吗？或者这里是否存在令人困惑的变量？

（我使用的是移动 Core i5。）

原文

I have a large piece of code, part of whose body contains this piece of code:

result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1);

which I have vectorized as follows (everything is already a float):

__m128 r = _mm_mul_ps(_mm_set_ps(ny, nx, ny, nx),
                      _mm_set_ps(ny, nx, m_Ly, m_Lx));
__declspec(align(16)) int asInt[4] = {
    _mm_extract_ps(r,0), _mm_extract_ps(r,1),
    _mm_extract_ps(r,2), _mm_extract_ps(r,3)
};
float (&res)[4] = reinterpret_cast<float (&)[4]>(asInt);
result = (res[0] + res[1] + m_Lz) / sqrt(res[2] + res[3] + 1);

The result is correct; however, my benchmarking shows that the vectorized version is slower:

The non-vectorized version takes 3750 ms
The vectorized version takes 4050 ms
Setting result to 0 directly (and removing this part of the code entirely) reduces the entire process to 2500 ms

Given that the vectorized version only contains one set of SSE multiplications (instead of four individual FPU multiplications), why is it slower? Is the FPU indeed faster than SSE, or is there a confounding variable here?

(I'm on a mobile Core i5.)

分享到QQ

分享到微博