如何使用 SSE 内在函数针对打包 32x32 优化 C 代码 => 64 位乘法,并将这些结果的一半解包为(伽罗瓦域)
一段时间以来,我一直在努力解决我正在开发的应用程序中网络编码的性能问题(请参阅优化 SSE -code, 提高网络性能coding-encoding 和 OpenCL 分发)。现在我已经非常接近达到可接受的性能了。这是最内层循环的当前状态(其中花费了超过 99% 的执行时间):
while(elementIterations-- >0)
{
unsigned int firstMessageField = *(currentMessageGaloisFieldsArray++);
unsigned int secondMessageField = *(currentMessageGaloisFieldsArray++);
__m128i valuesToMultiply = _mm_set_epi32(0, secondMessageField, 0, firstMessageField);
__m128i mulitpliedHalves = _mm_mul_epu32(valuesToMultiply, fragmentCoefficentVector);
}
您对如何进一步优化它有什么建议吗?我知道如果没有更多的背景就很难做到,但感谢任何帮助!
I've been struggling for a while with the performance of the network coding in an application I'm developing (see Optimzing SSE-code, Improving performance of network coding-encoding and OpenCL distribution). Now I'm quite close to achieve acceptable performance. This is the current state of the innermost loop (which is where >99% of the execution time is being spent):
while(elementIterations-- >0)
{
unsigned int firstMessageField = *(currentMessageGaloisFieldsArray++);
unsigned int secondMessageField = *(currentMessageGaloisFieldsArray++);
__m128i valuesToMultiply = _mm_set_epi32(0, secondMessageField, 0, firstMessageField);
__m128i mulitpliedHalves = _mm_mul_epu32(valuesToMultiply, fragmentCoefficentVector);
}
Do you have any suggestions on how to further optimize this? I understand that it's hard to do without more context but any help is appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
现在我醒了,这是我的答案:
在您的原始代码中,瓶颈几乎肯定是
_mm_set_epi32
。这个单一的内在函数被编译成你程序集中的混乱:这是什么? 9 条指令?!?!?! 纯粹的开销...
另一个看起来奇怪的地方是编译器没有合并添加和加载:
应该合并到:
我不确定编译器是否去了脑死亡,或者如果它确实有正当理由这样做……无论如何,与
_mm_set_epi32
相比,这只是一件小事。免责声明:我将从这里开始展示的代码违反了严格别名。但通常需要非标准兼容方法来实现最大性能。
解决方案 1:无矢量化
该解决方案假设
allZero
实际上全为零。该循环实际上比看起来更简单。由于没有太多算术,最好不要向量化:
在 x64 上编译为这个:
在 x86 上编译为这个:
这两个都可能已经比原始(SSE)代码更快... 在 x64 上,展开它会让它变得更好。
解决方案 2:SSE2 Integer Shuffle
此解决方案将循环展开为 2 次迭代:
编译为此 (x86):(x64 并没有太大不同)
仅比两次迭代的非矢量化版本。这使用很少的寄存器,因此即使在 x86 上您也可以进一步展开它。
说明:
_mm_shuffle_epi32
替换_mm_set_epi32
(在原始代码中编译为大约 9 个指令)。Now that I'm awake, here's my answer:
In your original code, the bottleneck is almost certainly
_mm_set_epi32
. This single intrinsic gets compiled into this mess in your assembly:What is this? 9 instructions?!?!?! Pure overhead...
Another place that seems odd is that the compiler didn't merge the adds and loads:
should have been merged into:
I'm not sure if the compiler went brain-dead, or if it actually had a legitimate reason to do that... Anyways, it's a small thing compared to the
_mm_set_epi32
.Disclaimer: The code I will present from here on violates strict-aliasing. But non-standard compliant methods are often needed to achieve maximum performance.
Solution 1: No Vectorization
This solution assumes
allZero
is really all zeros.The loop is actually simpler than it looks. Since there isn't a lot of arithmetic, it might be better to just not vectorize:
Which compiles to this on x64:
and this on x86:
It's possible that both of these are already faster than your original (SSE) code... On x64, Unrolling it will make it even better.
Solution 2: SSE2 Integer Shuffle
This solution unrolls the loop to 2 iterations:
which gets compiled to this (x86): (x64 isn't too different)
Only slightly longer than the non-vectorized version for two iterations. This uses very few registers, so you can further unroll this even on x86.
Explanations:
_mm_set_epi32
(which gets compiled into about ~9 instructions in your original code) can be replaced with a single_mm_shuffle_epi32
.我建议您将循环展开 2 倍,以便可以使用一个 _mm_load_XXX 加载 4 个 messageField 值,然后将这四个值解压为两个向量对并根据当前循环处理它们。这样,编译器就不会为 _mm_set_epi32 生成大量混乱的代码,并且所有加载和存储都将是 128 位 SSE 加载/存储。这也将使编译器有更多机会在循环内以最佳方式调度指令。
I suggest you unroll your loop by a factor of 2 so that you can load 4 messageField values using one _mm_load_XXX, and then unpack these four values into two vector pairs and process them as per the current loop. That way you won't have a lot of messy code being generated by the compiler for _mm_set_epi32 and all your loads and stores will be 128 bit SSE loads/stores. This will also give the compiler more opportunity to schedule instructions optimally within the loop.