优化 SSE 代码

发布于 2024-12-10 12:02:11 字数 1015 浏览 1 评论 0原文

我目前正在为需要一些性能改进的 Java 应用程序开发 C 模块(请参阅 提高网络编码的性能-后台编码)。我尝试使用 SSE-intrinsics 优化代码,它的执行速度比 Java 版本快一些(~20%)。然而,它仍然不够快。

不幸的是,我优化 C 代码的经验有些有限。因此,我很想获得一些关于如何改进当前实施的想法。

构成热点的内循环如下所示:

for (i = 0; i < numberOfGFVectorsInFragment; i++)   {

        // Load the 4 GF-elements from the message-fragment and add the log of the coefficeint to them.
        __m128i currentMessageFragmentVector = _mm_load_si128 (currentMessageFragmentPtr);
        __m128i currentEncodedResult = _mm_load_si128(encodedFragmentResultArray);

        __m128i logSumVector = _mm_add_epi32(coefficientLogValueVector, currentMessageFragmentVector);

        __m128i updatedResultVector = _mm_xor_si128(currentEncodedResult, valuesToXor);
        _mm_store_si128(encodedFragmentResultArray, updatedResultVector);

        encodedFragmentResultArray++;
        currentMessageFragmentPtr++;
    }

I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough.

Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation.

The inner loop that constitutes the hot-spot looks like this:

for (i = 0; i < numberOfGFVectorsInFragment; i++)   {

        // Load the 4 GF-elements from the message-fragment and add the log of the coefficeint to them.
        __m128i currentMessageFragmentVector = _mm_load_si128 (currentMessageFragmentPtr);
        __m128i currentEncodedResult = _mm_load_si128(encodedFragmentResultArray);

        __m128i logSumVector = _mm_add_epi32(coefficientLogValueVector, currentMessageFragmentVector);

        __m128i updatedResultVector = _mm_xor_si128(currentEncodedResult, valuesToXor);
        _mm_store_si128(encodedFragmentResultArray, updatedResultVector);

        encodedFragmentResultArray++;
        currentMessageFragmentPtr++;
    }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

狠疯拽 2024-12-17 12:02:11

即使不查看程序集,我也可以立即看出瓶颈来自 4 元素收集内存访问和 _mm_set_epi32 打包操作。在内部,_mm_set_epi32(在您的情况下)可能会被实现为一系列unpacklo/hi指令。

此循环中的大部分“工作”来自打包这 4 个内存访问。在没有 SSE4.1 的情况下,我想说循环可以更快地非矢量化,但展开。

如果你愿意使用SSE4.1,你可以尝试一下。它可能会更快,也可能不会:

    int* logSumArray = (int*)(&logSumVector);

    __m128i valuesToXor = _mm_cvtsi32_si128(expTable[*(logSumArray++)]);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 1);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 2);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 3);

我建议展开循环至少 4 次迭代,并交错所有指令,以使此代码有机会良好执行。

您真正需要的是英特尔的 AVX2 聚集/分散指令。但这是几年后的事情了……

Even without looking at the assembly, I can tell right away that the bottleneck is from the 4-element gather memory access and from the _mm_set_epi32 packing operations. Internally, _mm_set_epi32, in your case will probably be implemented as a series of unpacklo/hi instructions.

Most of the "work" in this loop is from packing these 4 memory accesses. In the absence of SSE4.1, I would go so far to say that the loop could be faster non-vectorized, but unrolled.

If you're willing to use SSE4.1, you can try this. It might be faster, it might not:

    int* logSumArray = (int*)(&logSumVector);

    __m128i valuesToXor = _mm_cvtsi32_si128(expTable[*(logSumArray++)]);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 1);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 2);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 3);

I suggest unrolling the loop at least 4 iterations and interleaving all the instructions to give this code any chance of performing well.

What you really need is Intel's AVX2 gather/scatter instructions. But that's a few years down the road...

流年已逝 2024-12-17 12:02:11

也许尝试 http://web.eecs.utk.edu /~plank/plank/papers/CS-07-593/
名称中带有“region”的函数据说速度很快。他们似乎没有使用任何类型的特殊指令集,但也许他们已经以其他方式进行了优化......

Maybe try http://web.eecs.utk.edu/~plank/plank/papers/CS-07-593/.
The functions with "region" in their names are supposedly fast. They don't seem to use any kind of special instruction sets, but maybe they've been optimized in other ways...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文