使用 SSE 内在函数将 4 点积存储到 C 中的连续数组中的最有效方法

发布于 2024-10-01 17:27:44 字数 1095 浏览 5 评论 0原文

我正在使用 SSE 内在函数优化 Intel x86 Nehalem 微架构的一些代码。

我的程序的一部分计算 4 个点积，并将每个结果添加到数组的连续块中的先前值。更具体地说，

tmp0 = _mm_dp_ps(A_0m, B_0m, 0xF1);
tmp1 = _mm_dp_ps(A_1m, B_0m, 0xF2);
tmp2 = _mm_dp_ps(A_2m, B_0m, 0xF4);
tmp3 = _mm_dp_ps(A_3m, B_0m, 0xF8);

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);
tmp0 = _mm_add_ps(tmp0, C_0n);

_mm_storeu_ps(C_2, tmp0);

请注意，我将使用 4 个临时 xmm 寄存器来保存每个点积的结果。在每个 xmm 寄存器中，结果被放置到相对于其他临时 xmm 寄存器唯一的 32 位中，最终结果如下所示：

tmp0= R0-零-零-零

tmp1= 零-R1-零-零

tmp2=零-零-R2-零

tmp3= 零-零-零-R3

我将每个 tmp 变量中包含的值组合成一个 xmm 变量，方法是使用以下指令将它们相加：

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);

最后，我添加包含所有 4 个结果的寄存器点积到数组的连续部分，以便数组的索引按点积递增，如下所示（C_0n 是数组中当前要更新的 4 个值；C_2 是指向这 4 个值的地址）：

tmp0 = _mm_add_ps(tmp0, C_0n);
_mm_storeu_ps(C_2, tmp0);

我想知道是否有一种更少迂回、更有效的方法来获取点积的结果并将它们添加到数组的连续块中。通过这种方式，我在寄存器之间进行了 3 次加法，其中只有 1 个非零值。看来应该有一种更有效的方法来解决这个问题。

我感谢所有的帮助。谢谢。

原文

I am optimizing some code for an Intel x86 Nehalem micro-architecture using SSE intrinsics.

A portion of my program computes 4 dot products and adds each result to the previous values in a contiguous chunk of an array. More specifically,

tmp0 = _mm_dp_ps(A_0m, B_0m, 0xF1);
tmp1 = _mm_dp_ps(A_1m, B_0m, 0xF2);
tmp2 = _mm_dp_ps(A_2m, B_0m, 0xF4);
tmp3 = _mm_dp_ps(A_3m, B_0m, 0xF8);

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);
tmp0 = _mm_add_ps(tmp0, C_0n);

_mm_storeu_ps(C_2, tmp0);

Notice that I am going about this by using 4 temporary xmm registers to hold the result of each dot product. In each xmm register, the result is placed into a unique 32 bits relative to the other temporary xmm registers such that the end result looks like this:

tmp0= R0-zero-zero-zero

tmp1= zero-R1-zero-zero

tmp2= zero-zero-R2-zero

tmp3= zero-zero-zero-R3

I combine the values contained in each tmp variable into one xmm variable by summing them up with the following instructions:

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);

Finally, I add the register containing all 4 results of the dot products to a contiguous part of an array so that the array's indexes are incremented by a dot product, like so (C_0n are the 4 values currently in the array that is to be updated; C_2 is the address pointing to these 4 values):

tmp0 = _mm_add_ps(tmp0, C_0n);
_mm_storeu_ps(C_2, tmp0);

I want to know if there is a less round-about, more efficient way to take the results of the dot products and add them to the contiguous chunk of the array. In this way, I am doing 3 additions between registers that only have 1 non-zero value in them. It seems there should be a more effective way to go about this.

I appreciate all help. Thank you.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绝影如岚 2024-10-08 17:27:44

对于这样的代码，我喜欢存储 A 和 B 的“转置”，以便 {A_0m.x，A_1m.x，A_2m.x，A_3m.x} 存储在一个向量中，等等。然后你可以这样做点积仅使用乘法和加法，完成后，您将在一个向量中拥有所有 4 个点积，而无需任何洗牌。

这在光线追踪中经常使用，以针对一个平面同时测试 4 条光线（例如，当遍历 kd 树时）。但是，如果您无法控制输入数据，则进行转置的开销可能不值得。该代码也可以在 SSE4 之前的计算机上运行，尽管这可能不是问题。

关于现有代码的一个小效率说明：而不是这样

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);
tmp0 = _mm_add_ps(tmp0, C_0n);

做这样做可能会稍微好一点：

tmp0 = _mm_add_ps(tmp0, tmp1);  // 0 + 1 -> 0
tmp2 = _mm_add_ps(tmp2, tmp3);  // 2 + 3 -> 2
tmp0 = _mm_add_ps(tmp0, tmp2);  // 0 + 2 -> 0
tmp0 = _mm_add_ps(tmp0, C_0n);

因为前两个 mm_add_ps 现在是完全独立的。另外，我不知道添加与洗牌的相对时间，但这可能会稍微快一些。

希望有帮助。

For code like this, I like to store the "transpose" of the A's and B's, so that {A_0m.x, A_1m.x, A_2m.x, A_3m.x} are stored in one vector, etc. Then you can do the dot product using just multiplies and adds, and when you're done, you have all 4 dot products in one vector without any shuffling.

This is used frequently in raytracing, to test 4 rays at once against a plane (e.g. when traversing a kd-tree). If you don't have control over the input data, though, the overhead of doing the transpose might not be worth it. The code will also run on pre-SSE4 machines, although that might not be an issue.

A small efficiency note on the existing code: instead of this

tmp0 = _mm_add_ps(tmp0, tmp1);
tmp0 = _mm_add_ps(tmp0, tmp2);
tmp0 = _mm_add_ps(tmp0, tmp3);
tmp0 = _mm_add_ps(tmp0, C_0n);

It may be slightly better to do this:

tmp0 = _mm_add_ps(tmp0, tmp1);  // 0 + 1 -> 0
tmp2 = _mm_add_ps(tmp2, tmp3);  // 2 + 3 -> 2
tmp0 = _mm_add_ps(tmp0, tmp2);  // 0 + 2 -> 0
tmp0 = _mm_add_ps(tmp0, C_0n);

As the first two mm_add_ps's are completely independent now. Also, I don't know the relative timings of adding vs. shuffling, but that might be slightly faster.

Hope that helps.

回复收藏 0 原文

她如夕阳 2024-10-08 17:27:44

也可以使用 SSE3 hadd。在一些简单的测试中，它比使用 _dot_ps 更快。
这将返回 4 个可以相加的点积。

static inline __m128 dot_p(const __m128 x, const __m128 y[4])
{
   __m128 z[4];

   z[0] = x * y[0];
   z[1] = x * y[1];
   z[2] = x * y[2];
   z[3] = x * y[3];
   z[0] = _mm_hadd_ps(z[0], z[1]);
   z[2] = _mm_hadd_ps(z[2], z[3]);
   z[0] = _mm_hadd_ps(z[0], z[2]);

   return z[0];
}

It is also possible to use the SSE3 hadd. It turned out faster than using _dot_ps, in some trivial tests.
This returns 4 dot products which could be added.

static inline __m128 dot_p(const __m128 x, const __m128 y[4])
{
   __m128 z[4];

   z[0] = x * y[0];
   z[1] = x * y[1];
   z[2] = x * y[2];
   z[3] = x * y[3];
   z[0] = _mm_hadd_ps(z[0], z[1]);
   z[2] = _mm_hadd_ps(z[2], z[3]);
   z[0] = _mm_hadd_ps(z[0], z[2]);

   return z[0];
}

回复收藏 0 原文

探春 2024-10-08 17:27:44

您可以尝试将点积结果保留在低位字中，并使用标量存储操作 _mm_store_ss 将每个 m128 寄存器中的一个浮点数保存到数组的适当位置。 Nehalem 的存储缓冲区应该累积同一行上的连续写入，并将它们分批刷新到 L1。

专业的方法是 celion 的转置方法。 MSVC 的 _MM_TRANSPOSE4_PS 宏将为您执行转置。

回复收藏 0 原文

巾帼英雄 2024-10-08 17:27:44

我意识到这个问题已经很老了，但是为什么要使用 _mm_add_ps 呢？将其替换为：

tmp0 = _mm_or_ps(tmp0, tmp1);
tmp2 = _mm_or_ps(tmp2, tmp3);
tmp0 = _mm_or_ps(tmp0, tmp2);

您也许可以隐藏一些 _mm_dp_ps 延迟。第一个 _mm_or_ps 也不等待最终的 2 点积，并且它是一个（快速）按位运算。最后：

_mm_storeu_ps(C_2, _mm_add_ps(tmp0, C_0));

I realize this question is old, but why use _mm_add_ps at all? Replace it with:

tmp0 = _mm_or_ps(tmp0, tmp1);
tmp2 = _mm_or_ps(tmp2, tmp3);
tmp0 = _mm_or_ps(tmp0, tmp2);

You can probably hide some of the _mm_dp_ps latency. The first _mm_or_ps doesn't wait for the final 2 dot products either, and it's a (fast) bit-wise operation. Finally:

_mm_storeu_ps(C_2, _mm_add_ps(tmp0, C_0));

回复收藏 0 原文

~没有更多了~

关于作者

谁许谁一生繁华

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

使用 SSE 内在函数将 4 点积存储到 C 中的连续数组中的最有效方法

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

使用 SSE 内在函数将 4 点积存储到 C 中的连续数组中的最有效方法

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

1CH1MKgiKxn9p

ゞ记忆︶ㄣ

JackDx

信远

yaoduoduo1995

霞映澄塘

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。