用于比较 (_mm_cmpeq_ps) 和赋值操作的 SSE 内在函数

发布于 2024-12-13 18:10:33 字数 498 浏览 2 评论 0原文

我已经开始使用 SSE 优化我的代码。本质上，它是一个光线追踪器，通过将坐标存储在 __m128 数据类型 x、y、z 中（四条光线的坐标按轴分组），一次处理 4 条光线。不过，我有一个分支语句可以防止被零除，我似乎无法转换为 SSE。串联起来是：

const float d = wZ == -1.0f ? 1.0f/( 1.0f-wZ) : 1.0f/(1.0f+wZ);

其中 wZ 是 z 坐标，并且需要对所有四个射线进行此计算。

我怎样才能将其翻译成SSE？

我一直在尝试使用 SSE 等于比较，如下所示（现在 wz 属于 __m128 数据类型，包含四个射线中每一个的 z 值）：

_mm_cmpeq_ps(_mm_set1_ps(-1.0f) , wZ )

然后使用它来识别 wZ[x] = -1.0 的情况，取的绝对值，然后继续正常计算。

然而我在这方面的努力并没有取得太大的成功。

原文

I have started optimising my code using SSE. Essentially it is a ray tracer that processes 4 rays at a time by storing the coordinates in __m128 data types x, y, z (the coordinates for the four rays are grouped by axis). However I have a branched statement which protects against divide by zero I can't seem to convert to SSE. In serial this is:

const float d = wZ == -1.0f ? 1.0f/( 1.0f-wZ) : 1.0f/(1.0f+wZ);

Where wZ is the z-coordinate and this calculation needs to be done for all four rays.

How could I translate this into SSE?

I have been experimenting using the SSE equals comparison as follows (now wz pertains to a __m128 data type containing the z values for each of the four rays):

_mm_cmpeq_ps(_mm_set1_ps(-1.0f) , wZ )

And then using this to identify cases where wZ[x] = -1.0, taking the absolute value of this case and then continue the calculation as normal.

However I have not had much success in this endeavour.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

嘴硬脾气大 2024-12-20 18:10:33

这是一个相当简单的解决方案，它仅使用 SSE 实现标量代码，而不进行任何进一步的优化。它可能会更高效一点，例如，通过利用当 wZ = -1.0 时结果将为 0.5 的事实，或者甚至可能只进行除法，然后将 INF 转换为事后0.5。

我对 SSE4 与 SSE4 之前的版本进行了 #ifdef 比较，因为 SSE4 有一个“混合”指令，该指令可能比 SSE4 之前的三个指令更有效，否则需要屏蔽和选择价值观。

#include <emmintrin.h>
#ifdef __SSE4_1__
#include <smmintrin.h>
#endif

#include <stdio.h>

int main(void)
{
    const __m128 vk1 = _mm_set1_ps(1.0f);       // useful constants
    const __m128 vk0 = _mm_set1_ps(0.0f);

    __m128 wZ, d, d0, d1, vcmp;
#ifndef __SSE4_1__  // pre-SSE4 implementation
    __m128 d0_masked, d1_masked;
#endif

    wZ = _mm_set_ps(-1.0f, 0.0f, 1.0f, 2.0f);   // test inputs

    d0 = _mm_add_ps(vk1, wZ);                   // d0 = 1.0 - wZ
    d1 = _mm_sub_ps(vk1, wZ);                   // d1 = 1.0 + wZ
    vcmp = _mm_cmpneq_ps(d1, vk0);              // test for d1 != 0.0, i.e. wZ != -1.0
#ifdef __SSE4_1__   // SSE4 implementation
    d = _mm_blendv_ps(d0, d1, vcmp);
#else               // pre-SSE4 implementation
    d0_masked = _mm_andnot_ps(vcmp, d0);
    d1_masked = _mm_and_ps(vcmp, d1);
    d = _mm_or_ps(d0_masked, d1_masked);       // d = wZ == -1.0 ? 1.0 / (1.0 - wZ) : 1.0 / (1.0 + wZ)
#endif
   d = _mm_div_ps(vk1, d);

   printf("wZ = %vf\n", wZ);
   printf("d = %vf\n", d);

   return 0;
}

Here's a fairly straightforward solution which just implements the scalar code with SSE without any further optimisation. It can probably be made a little more efficient, e.g. by exploiting the fact that the result will be 0.5 when wZ = -1.0, or perhaps even by just doing the division regardless and then converting the INFs to 0.5 after the fact.

I've #ifdefd for SSE4 versus pre-SSE4, since SSE4 has a "blend" instruction which may be a little more efficient that the three pre-SSE4 instructions that are otherwise needed to mask and select values.

#include <emmintrin.h>
#ifdef __SSE4_1__
#include <smmintrin.h>
#endif

#include <stdio.h>

int main(void)
{
    const __m128 vk1 = _mm_set1_ps(1.0f);       // useful constants
    const __m128 vk0 = _mm_set1_ps(0.0f);

    __m128 wZ, d, d0, d1, vcmp;
#ifndef __SSE4_1__  // pre-SSE4 implementation
    __m128 d0_masked, d1_masked;
#endif

    wZ = _mm_set_ps(-1.0f, 0.0f, 1.0f, 2.0f);   // test inputs

    d0 = _mm_add_ps(vk1, wZ);                   // d0 = 1.0 - wZ
    d1 = _mm_sub_ps(vk1, wZ);                   // d1 = 1.0 + wZ
    vcmp = _mm_cmpneq_ps(d1, vk0);              // test for d1 != 0.0, i.e. wZ != -1.0
#ifdef __SSE4_1__   // SSE4 implementation
    d = _mm_blendv_ps(d0, d1, vcmp);
#else               // pre-SSE4 implementation
    d0_masked = _mm_andnot_ps(vcmp, d0);
    d1_masked = _mm_and_ps(vcmp, d1);
    d = _mm_or_ps(d0_masked, d1_masked);       // d = wZ == -1.0 ? 1.0 / (1.0 - wZ) : 1.0 / (1.0 + wZ)
#endif
   d = _mm_div_ps(vk1, d);

   printf("wZ = %vf\n", wZ);
   printf("d = %vf\n", d);

   return 0;
}

回复收藏 0 原文

~没有更多了~