SSE比FPU慢?
我有一大段代码,其主体部分包含这段代码:
result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1);
我已将其矢量化如下(一切都已经是 float
):
__m128 r = _mm_mul_ps(_mm_set_ps(ny, nx, ny, nx),
_mm_set_ps(ny, nx, m_Ly, m_Lx));
__declspec(align(16)) int asInt[4] = {
_mm_extract_ps(r,0), _mm_extract_ps(r,1),
_mm_extract_ps(r,2), _mm_extract_ps(r,3)
};
float (&res)[4] = reinterpret_cast<float (&)[4]>(asInt);
result = (res[0] + res[1] + m_Lz) / sqrt(res[2] + res[3] + 1);
结果是正确的;但是,我的基准测试显示矢量化版本较慢:
- 非矢量化版本需要 3750 毫秒
- 矢量化版本需要 4050 毫秒
- 将
结果
设置为0
直接(并完全删除这部分代码)将整个过程减少到 2500 毫秒
鉴于矢量化版本仅包含一组 SSE 乘法(而不是四个单独的 FPU)乘法),为什么速度较慢? FPU 确实比 SSE 快吗?或者这里是否存在令人困惑的变量?
(我使用的是移动 Core i5。)
I have a large piece of code, part of whose body contains this piece of code:
result = (nx * m_Lx + ny * m_Ly + m_Lz) / sqrt(nx * nx + ny * ny + 1);
which I have vectorized as follows (everything is already a float
):
__m128 r = _mm_mul_ps(_mm_set_ps(ny, nx, ny, nx),
_mm_set_ps(ny, nx, m_Ly, m_Lx));
__declspec(align(16)) int asInt[4] = {
_mm_extract_ps(r,0), _mm_extract_ps(r,1),
_mm_extract_ps(r,2), _mm_extract_ps(r,3)
};
float (&res)[4] = reinterpret_cast<float (&)[4]>(asInt);
result = (res[0] + res[1] + m_Lz) / sqrt(res[2] + res[3] + 1);
The result is correct; however, my benchmarking shows that the vectorized version is slower:
- The non-vectorized version takes 3750 ms
- The vectorized version takes 4050 ms
- Setting
result
to0
directly (and removing this part of the code entirely) reduces the entire process to 2500 ms
Given that the vectorized version only contains one set of SSE multiplications (instead of four individual FPU multiplications), why is it slower? Is the FPU indeed faster than SSE, or is there a confounding variable here?
(I'm on a mobile Core i5.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您花费了大量时间使用
_mm_set_ps
和_mm_extract_ps
将标量值移入 SSE 寄存器 - 这会生成大量指令,其执行时间将远远超过使用_mm_mul_ps
的任何好处。查看生成的汇编输出,了解除了单个MULPS
指令之外还生成了多少代码。要正确对其进行矢量化,您需要使用 128 位 SSE 加载和存储 (
_mm_load_ps
/_mm_store_ps
),然后使用 SSE 洗牌指令在需要的地方在寄存器内移动元素。还需要注意一点 - 现代 CPU(例如 Core i5、Core i7)具有两个标量 FPU,每个时钟可以发出 2 个浮点乘法。因此,SSE 对单精度浮点的潜在好处最多也只有 2 倍。如果您有过多的“内务管理”说明,很容易失去大部分/全部的 2 倍好处,就像这里的情况一样。
You are spending a lot of time moving scalar values to/from SSE registers with
_mm_set_ps
and_mm_extract_ps
- this is generating a lot of instructions, the execution time of which will far outweigh any benefit from using_mm_mul_ps
. Take a look at the generated assembly output to see how much code is being generated in addition to the singleMULPS
instruction.To vectorize this properly you need to use 128 bit SSE loads and stores (
_mm_load_ps
/_mm_store_ps
) and then use SSE shuffle instructions to move elements around within registers where needed.One further point to note - modern CPUs such as Core i5, Core i7, have two scalar FPUs and can issue 2 floating point multiplies per clock. The potential benefit from SSE for single precision floating point is therefore only 2x at best. It's easy to lose most/all of this 2x benefit if you have excessive "housekeeping" instructions, as is the case here.
有几个问题:
There are several problems :
我的看法是,在使用 FPU 加载下一个值时,处理器有时间计算第一个乘法。 SSE 必须首先加载所有值。
My take would be that the processor has the time to compute the first multiplication when using the FPU while loading the next values. The SSE has to load all the values first.