使用 SSE 内在函数将 4 点积存储到 C 中的连续数组中的最有效方法

发布于 2024-10-01 17:27:44 字数 1095 浏览 5 评论 0原文

我正在使用 SSE 内在函数优化 Intel x86 Nehalem 微架构的一些代码。

我的程序的一部分计算 4 个点积,并将每个结果添加到数组的连续块中的先前值。更具体地说,

tmp0 = _mm_dp_ps(A_0m, B_0m, 0xF1);
tmp1 = _mm_dp_ps(A_1m, B_0m, 0xF2);
tmp2 = _mm_dp_ps(A_2m, B_0m, 0xF4);
tmp3 = _mm_dp_ps(A_3m, B_0m, 0xF8);

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);
tmp0 = _mm_add_ps(tmp0, C_0n);

_mm_storeu_ps(C_2, tmp0);

请注意,我将使用 4 个临时 xmm 寄存器来保存每个点积的结果。在每个 xmm 寄存器中,结果被放置到相对于其他临时 xmm 寄存器唯一的 32 位中,最终结果如下所示:

tmp0= R0-零-零-零

tmp1= 零-R1-零-零

tmp2=零-零-R2-零

tmp3= 零-零-零-R3

我将每个 tmp 变量中包含的值组合成一个 xmm 变量,方法是使用以下指令将它们相加:

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);

最后,我添加包含所有 4 个结果的寄存器点积到数组的连续部分,以便数组的索引按点积递增,如下所示(C_0n 是数组中当前要更新的 4 个值;C_2 是指向这 4 个值的地址):

tmp0 = _mm_add_ps(tmp0, C_0n);
_mm_storeu_ps(C_2, tmp0);

我想知道是否有一种更少迂回、更有效的方法来获取点积的结果并将它们添加到数组的连续块中。通过这种方式,我在寄存器之间进行了 3 次加法,其中只有 1 个非零值。看来应该有一种更有效的方法来解决这个问题。

我感谢所有的帮助。谢谢。

I am optimizing some code for an Intel x86 Nehalem micro-architecture using SSE intrinsics.

A portion of my program computes 4 dot products and adds each result to the previous values in a contiguous chunk of an array. More specifically,

tmp0 = _mm_dp_ps(A_0m, B_0m, 0xF1);
tmp1 = _mm_dp_ps(A_1m, B_0m, 0xF2);
tmp2 = _mm_dp_ps(A_2m, B_0m, 0xF4);
tmp3 = _mm_dp_ps(A_3m, B_0m, 0xF8);

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);
tmp0 = _mm_add_ps(tmp0, C_0n);

_mm_storeu_ps(C_2, tmp0);

Notice that I am going about this by using 4 temporary xmm registers to hold the result of each dot product. In each xmm register, the result is placed into a unique 32 bits relative to the other temporary xmm registers such that the end result looks like this:

tmp0= R0-zero-zero-zero

tmp1= zero-R1-zero-zero

tmp2= zero-zero-R2-zero

tmp3= zero-zero-zero-R3

I combine the values contained in each tmp variable into one xmm variable by summing them up with the following instructions:

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);

Finally, I add the register containing all 4 results of the dot products to a contiguous part of an array so that the array's indexes are incremented by a dot product, like so (C_0n are the 4 values currently in the array that is to be updated; C_2 is the address pointing to these 4 values):

tmp0 = _mm_add_ps(tmp0, C_0n);
_mm_storeu_ps(C_2, tmp0);

I want to know if there is a less round-about, more efficient way to take the results of the dot products and add them to the contiguous chunk of the array. In this way, I am doing 3 additions between registers that only have 1 non-zero value in them. It seems there should be a more effective way to go about this.

I appreciate all help. Thank you.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

绝影如岚 2024-10-08 17:27:44

对于这样的代码,我喜欢存储 A 和 B 的“转置”,以便 {A_0m.x,A_1m.x,A_2m.x,A_3m.x} 存储在一个向量中,等等。然后你可以这样做点积仅使用乘法和加法,完成后,您将在一个向量中拥有所有 4 个点积,而无需任何洗牌。

这在光线追踪中经常使用,以针对一个平面同时测试 4 条光线(例如,当遍历 kd 树时)。但是,如果您无法控制输入数据,则进行转置的开销可能不值得。该代码也可以在 SSE4 之前的计算机上运行,​​尽管这可能不是问题。


关于现有代码的一个小效率说明:而不是这样

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);
tmp0 = _mm_add_ps(tmp0, C_0n);

做 这样做可能会稍微好一点:

tmp0 = _mm_add_ps(tmp0, tmp1);  // 0 + 1 -> 0
tmp2 = _mm_add_ps(tmp2, tmp3);  // 2 + 3 -> 2
tmp0 = _mm_add_ps(tmp0, tmp2);  // 0 + 2 -> 0
tmp0 = _mm_add_ps(tmp0, C_0n);

因为前两个 mm_add_ps 现在是完全独立的。另外,我不知道添加与洗牌的相对时间,但这可能会稍微快一些。


希望有帮助。

For code like this, I like to store the "transpose" of the A's and B's, so that {A_0m.x, A_1m.x, A_2m.x, A_3m.x} are stored in one vector, etc. Then you can do the dot product using just multiplies and adds, and when you're done, you have all 4 dot products in one vector without any shuffling.

This is used frequently in raytracing, to test 4 rays at once against a plane (e.g. when traversing a kd-tree). If you don't have control over the input data, though, the overhead of doing the transpose might not be worth it. The code will also run on pre-SSE4 machines, although that might not be an issue.


A small efficiency note on the existing code: instead of this

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);
tmp0 = _mm_add_ps(tmp0, C_0n);

It may be slightly better to do this:

tmp0 = _mm_add_ps(tmp0, tmp1);  // 0 + 1 -> 0
tmp2 = _mm_add_ps(tmp2, tmp3);  // 2 + 3 -> 2
tmp0 = _mm_add_ps(tmp0, tmp2);  // 0 + 2 -> 0
tmp0 = _mm_add_ps(tmp0, C_0n);

As the first two mm_add_ps's are completely independent now. Also, I don't know the relative timings of adding vs. shuffling, but that might be slightly faster.


Hope that helps.

她如夕阳 2024-10-08 17:27:44

也可以使用 SSE3 hadd。在一些简单的测试中,它比使用 _dot_ps 更快。
这将返回 4 个可以相加的点积。

static inline __m128 dot_p(const __m128 x, const __m128 y[4])
{
   __m128 z[4];

   z[0] = x * y[0];
   z[1] = x * y[1];
   z[2] = x * y[2];
   z[3] = x * y[3];
   z[0] = _mm_hadd_ps(z[0], z[1]);
   z[2] = _mm_hadd_ps(z[2], z[3]);
   z[0] = _mm_hadd_ps(z[0], z[2]);

   return z[0];
}

It is also possible to use the SSE3 hadd. It turned out faster than using _dot_ps, in some trivial tests.
This returns 4 dot products which could be added.

static inline __m128 dot_p(const __m128 x, const __m128 y[4])
{
   __m128 z[4];

   z[0] = x * y[0];
   z[1] = x * y[1];
   z[2] = x * y[2];
   z[3] = x * y[3];
   z[0] = _mm_hadd_ps(z[0], z[1]);
   z[2] = _mm_hadd_ps(z[2], z[3]);
   z[0] = _mm_hadd_ps(z[0], z[2]);

   return z[0];
}
探春 2024-10-08 17:27:44

您可以尝试将点积结果保留在低位字中,并使用标量存储操作 _mm_store_ss 将每个 m128 寄存器中的一个浮点数保存到数组的适当位置。 Nehalem 的存储缓冲区应该累积同一行上的连续写入,并将它们分批刷新到 L1。

专业的方法是 celion 的转置方法。 MSVC 的 _MM_TRANSPOSE4_PS 宏将为您执行转置。

You could try leaving the dot product result in the low word and use the scalar store op _mm_store_ss to save that one float from each m128 register into the appropriate location of the array. Nehalem's store buffer should accumulate consecutive writes on the same line and flush them to L1 in batches.

The pro way to do it is celion's transpose approach. MSVC's _MM_TRANSPOSE4_PS macro will do the transpose for you.

巾帼英雄 2024-10-08 17:27:44

我意识到这个问题已经很老了,但是为什么要使用 _mm_add_ps 呢?将其替换为:

tmp0 = _mm_or_ps(tmp0, tmp1);
tmp2 = _mm_or_ps(tmp2, tmp3);
tmp0 = _mm_or_ps(tmp0, tmp2);

您也许可以隐藏一些 _mm_dp_ps 延迟。第一个 _mm_or_ps 也不等待最终的 2 点积,并且它是一个(快速)按位运算。最后:

_mm_storeu_ps(C_2, _mm_add_ps(tmp0, C_0));

I realize this question is old, but why use _mm_add_ps at all? Replace it with:

tmp0 = _mm_or_ps(tmp0, tmp1);
tmp2 = _mm_or_ps(tmp2, tmp3);
tmp0 = _mm_or_ps(tmp0, tmp2);

You can probably hide some of the _mm_dp_ps latency. The first _mm_or_ps doesn't wait for the final 2 dot products either, and it's a (fast) bit-wise operation. Finally:

_mm_storeu_ps(C_2, _mm_add_ps(tmp0, C_0));
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文