优化 SSE 代码
我目前正在为需要一些性能改进的 Java 应用程序开发 C 模块(请参阅 提高网络编码的性能-后台编码)。我尝试使用 SSE-intrinsics 优化代码,它的执行速度比 Java 版本快一些(~20%)。然而,它仍然不够快。
不幸的是,我优化 C 代码的经验有些有限。因此,我很想获得一些关于如何改进当前实施的想法。
构成热点的内循环如下所示:
for (i = 0; i < numberOfGFVectorsInFragment; i++) {
// Load the 4 GF-elements from the message-fragment and add the log of the coefficeint to them.
__m128i currentMessageFragmentVector = _mm_load_si128 (currentMessageFragmentPtr);
__m128i currentEncodedResult = _mm_load_si128(encodedFragmentResultArray);
__m128i logSumVector = _mm_add_epi32(coefficientLogValueVector, currentMessageFragmentVector);
__m128i updatedResultVector = _mm_xor_si128(currentEncodedResult, valuesToXor);
_mm_store_si128(encodedFragmentResultArray, updatedResultVector);
encodedFragmentResultArray++;
currentMessageFragmentPtr++;
}
I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough.
Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation.
The inner loop that constitutes the hot-spot looks like this:
for (i = 0; i < numberOfGFVectorsInFragment; i++) {
// Load the 4 GF-elements from the message-fragment and add the log of the coefficeint to them.
__m128i currentMessageFragmentVector = _mm_load_si128 (currentMessageFragmentPtr);
__m128i currentEncodedResult = _mm_load_si128(encodedFragmentResultArray);
__m128i logSumVector = _mm_add_epi32(coefficientLogValueVector, currentMessageFragmentVector);
__m128i updatedResultVector = _mm_xor_si128(currentEncodedResult, valuesToXor);
_mm_store_si128(encodedFragmentResultArray, updatedResultVector);
encodedFragmentResultArray++;
currentMessageFragmentPtr++;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
即使不查看程序集,我也可以立即看出瓶颈来自 4 元素收集内存访问和
_mm_set_epi32
打包操作。在内部,_mm_set_epi32
(在您的情况下)可能会被实现为一系列unpacklo/hi
指令。此循环中的大部分“工作”来自打包这 4 个内存访问。在没有 SSE4.1 的情况下,我想说循环可以更快地非矢量化,但展开。
如果你愿意使用SSE4.1,你可以尝试一下。它可能会更快,也可能不会:
我建议展开循环至少 4 次迭代,并交错所有指令,以使此代码有机会良好执行。
您真正需要的是英特尔的 AVX2 聚集/分散指令。但这是几年后的事情了……
Even without looking at the assembly, I can tell right away that the bottleneck is from the 4-element gather memory access and from the
_mm_set_epi32
packing operations. Internally,_mm_set_epi32
, in your case will probably be implemented as a series ofunpacklo/hi
instructions.Most of the "work" in this loop is from packing these 4 memory accesses. In the absence of SSE4.1, I would go so far to say that the loop could be faster non-vectorized, but unrolled.
If you're willing to use SSE4.1, you can try this. It might be faster, it might not:
I suggest unrolling the loop at least 4 iterations and interleaving all the instructions to give this code any chance of performing well.
What you really need is Intel's AVX2 gather/scatter instructions. But that's a few years down the road...
也许尝试 http://web.eecs.utk.edu /~plank/plank/papers/CS-07-593/。
名称中带有“region”的函数据说速度很快。他们似乎没有使用任何类型的特殊指令集,但也许他们已经以其他方式进行了优化......
Maybe try http://web.eecs.utk.edu/~plank/plank/papers/CS-07-593/.
The functions with "region" in their names are supposedly fast. They don't seem to use any kind of special instruction sets, but maybe they've been optimized in other ways...