将 SSE2 迁移到 Arm NEON 内在函数

发布于 2024-12-06 10:05:45 字数 1261 浏览 2 评论 0原文

我在 SSE2 intrinsincs 中有以下代码。它处理来自 Kinect 的输入。

__m128i md = _mm_setr_epi16((r0<<3)  | (r1>>5), ((r1<<6)  | (r2>>2) ), ((r2<<9)  | (r3<<1) | (r4>>7) ), ((r4<<4)  | (r5>>4) ), ((r5<<7)  | (r6>>1) ),((r6<<10) | (r7<<2) | (r8>>6) ), ((r8<<5)  | (r9>>3) ), ((r9<<8)  | (r10)   ));
md = _mm_and_si128(md, mmask);
__m128i mz = _mm_load_si128((__m128i *) &depth_ref_z[i]);
__m128i mZ = _mm_load_si128((__m128i *) &depth_ref_Z[i]);
mz = _mm_cmpgt_epi16(md, mz);
mZ = _mm_cmpgt_epi16(mZ, md);
mz = _mm_and_si128(mz, mZ);
md = _mm_and_si128(mz, md);
_mm_store_si128((__m128i *) frame,md)
if(_mm_movemask_epi8(mz)){ ... }

这基本上将 11 个 uint8_t (r0-r10) 解包到 SSE 寄存器中的 8 个 uint16_t（mmask 是常量，之前创建）。然后，它加载另外两个寄存器，其中包含两个用作边界的数组中的相应元素。它检查它们并创建一个寄存器，其中不符合条件的元素被清零。然后它存储它们并进一步处理每个元素。当没有元素通过时，movemask 可以作为一个很好的优化，在这种情况下可以跳过处理。

这很好用，现在我也想将它移植到 NEON。除了两个部分之外，大部分内容都很简单。查看 SSE2 代码中的汇编器输出（gcc），我发现不是在 _mm_setr_epi16 中执行 8 个 uint16_t 移动，而是将它们移位或或运算到 uint32_t 中，最后执行 4 个移动。这看起来很有效，因为编译器会处理它，所以我没有更改代码。我应该在 NEON 情况下手动应用它吗？而不是 8 vsetq_lane_u16 进行换档并执行 4 vsetq_lane_u32？我会对字节序有任何疑问吗？这值得吗？

最后一部分是 movemask，因为我还没有找到等效的。谁能建议一下吗？

原文

I have the following code in SSE2 intrinsincs. It processes input from a Kinect.

__m128i md = _mm_setr_epi16((r0<<3)  | (r1>>5), ((r1<<6)  | (r2>>2) ), ((r2<<9)  | (r3<<1) | (r4>>7) ), ((r4<<4)  | (r5>>4) ), ((r5<<7)  | (r6>>1) ),((r6<<10) | (r7<<2) | (r8>>6) ), ((r8<<5)  | (r9>>3) ), ((r9<<8)  | (r10)   ));
md = _mm_and_si128(md, mmask);
__m128i mz = _mm_load_si128((__m128i *) &depth_ref_z[i]);
__m128i mZ = _mm_load_si128((__m128i *) &depth_ref_Z[i]);
mz = _mm_cmpgt_epi16(md, mz);
mZ = _mm_cmpgt_epi16(mZ, md);
mz = _mm_and_si128(mz, mZ);
md = _mm_and_si128(mz, md);
_mm_store_si128((__m128i *) frame,md)
if(_mm_movemask_epi8(mz)){ ... }

This basically unpacks 11 uint8_t (r0-r10) to 8 uint16_t in an SSE register(mmask is constant and created previously). It then loads two more registers with the corresponding elements from two arrays that serve as bounds. It checks them and creates a register which has the elements that don't fit in the criteria zeroed out. It then stores them and goes to further process each element. The movemask serves as a nice optimization when none of the elements pass in which case the processing can be skipped.

This works nice and now I want to port it to NEON as well. Most of it is straightforward except two parts. Looking at the assembler output(gcc) from the SSE2 code I see that instead of doing 8 uint16_t moves in _mm_setr_epi16 it shifts and ors them into uint32_t and finally does 4 moves. That seems efficient and since the compiler takes care of it I didn't change the code. Should I apply that manually in the NEON case? Instead of 8 vsetq_lane_u16 do the shifting and perform 4 vsetq_lane_u32? Will I have any issues with endianess and will it be worthwhile?

The final part is the movemask as I haven't been able to find an equivalent. Can anyone suggest something?

分享到QQ

分享到微博