用于比较 (_mm_cmpeq_ps) 和赋值操作的 SSE 内在函数
我已经开始使用 SSE 优化我的代码。本质上,它是一个光线追踪器,通过将坐标存储在 __m128 数据类型 x、y、z 中(四条光线的坐标按轴分组),一次处理 4 条光线。不过,我有一个分支语句可以防止被零除,我似乎无法转换为 SSE。串联起来是:
const float d = wZ == -1.0f ? 1.0f/( 1.0f-wZ) : 1.0f/(1.0f+wZ);
其中 wZ 是 z 坐标,并且需要对所有四个射线进行此计算。
我怎样才能将其翻译成SSE?
我一直在尝试使用 SSE 等于比较,如下所示(现在 wz 属于 __m128 数据类型,包含四个射线中每一个的 z 值):
_mm_cmpeq_ps(_mm_set1_ps(-1.0f) , wZ )
然后使用它来识别 wZ[x] = -1.0 的情况,取的绝对值,然后继续正常计算。
然而我在这方面的努力并没有取得太大的成功。
I have started optimising my code using SSE. Essentially it is a ray tracer that processes 4 rays at a time by storing the coordinates in __m128 data types x, y, z (the coordinates for the four rays are grouped by axis). However I have a branched statement which protects against divide by zero I can't seem to convert to SSE. In serial this is:
const float d = wZ == -1.0f ? 1.0f/( 1.0f-wZ) : 1.0f/(1.0f+wZ);
Where wZ is the z-coordinate and this calculation needs to be done for all four rays.
How could I translate this into SSE?
I have been experimenting using the SSE equals comparison as follows (now wz pertains to a __m128 data type containing the z values for each of the four rays):
_mm_cmpeq_ps(_mm_set1_ps(-1.0f) , wZ )
And then using this to identify cases where wZ[x] = -1.0, taking the absolute value of this case and then continue the calculation as normal.
However I have not had much success in this endeavour.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是一个相当简单的解决方案,它仅使用 SSE 实现标量代码,而不进行任何进一步的优化。它可能会更高效一点,例如,通过利用当 wZ = -1.0 时结果将为 0.5 的事实,或者甚至可能只进行除法,然后将
INF
转换为事后0.5。我对 SSE4 与 SSE4 之前的版本进行了 #ifdef 比较,因为 SSE4 有一个“混合”指令,该指令可能比 SSE4 之前的三个指令更有效,否则需要屏蔽和选择价值观。
Here's a fairly straightforward solution which just implements the scalar code with SSE without any further optimisation. It can probably be made a little more efficient, e.g. by exploiting the fact that the result will be 0.5 when wZ = -1.0, or perhaps even by just doing the division regardless and then converting the
INF
s to 0.5 after the fact.I've
#ifdef
d for SSE4 versus pre-SSE4, since SSE4 has a "blend" instruction which may be a little more efficient that the three pre-SSE4 instructions that are otherwise needed to mask and select values.