SSE 和 NEON Intrinsics-Shuffling 之间的转换
我正在尝试将以 SSE3 内在函数编写的代码转换为 NEON SIMD,但由于随机播放功能而陷入困境。我查看了 GCC 内在s,ARM 手册和其他论坛但尚未能够找到解决办法。
代码:
_m128i upper = _mm_loadu_si128((__m128i*)p1);
register __m128i mask1 = _mm_set_epi8 (0x80,0x80,0x80,0x80,0x80,0x80,0x80,12,0x80,10,0x80,7,0x80,4,0x80,1);
register __m128i mask2 = _mm_set_epi8 (0x80,0x80,0x80,0x80,0x80,0x80,12,0x80,10,0x80,7,0x80,4,0x80,1,0x80);
__m128i temp1_upper = _mm_or_si128(_mm_shuffle_epi8(upper,mask1),_mm_shuffle_epi8(upper,mask2));
虽然 vtbl1_u8(uint8x8_t,uint8x8_t) 指令创建了一个可用于将值分配给目标寄存器的查找表,但它仅在 64 位寄存器上运行。此外,shuffle 操作在开始时执行比较,这必须是用 NEON 完成,我不知道如何有效地做到这一点。
r0 = (掩码0 & 0x80) ? 0 : SELECT(a, mask0 & 0x0f) // SELECT(a,n) 从 a 中提取第 n 个 8 位参数。
r1 = (掩码1 & 0x80) ? 0:选择(a,掩码1和0x0f)
...
我找不到一条指令首先检查掩码的高位,然后有效地选择掩码的低 4 位。我知道我们可以比较寄存器然后如果指定条件则选择低4位,但我希望能够有效地完成它。希望有人可以帮助或提供参考。
非常感谢,
干杯!
I am trying to convert a code written in SSE3 intrinsics to NEON SIMD and am stuck because of a shuffle function.I have looked at the GCC Intrinsics ,ARM manuals and other forums but have not been able to find a solution.
CODE:
_m128i upper = _mm_loadu_si128((__m128i*)p1);
register __m128i mask1 = _mm_set_epi8 (0x80,0x80,0x80,0x80,0x80,0x80,0x80,12,0x80,10,0x80,7,0x80,4,0x80,1);
register __m128i mask2 = _mm_set_epi8 (0x80,0x80,0x80,0x80,0x80,0x80,12,0x80,10,0x80,7,0x80,4,0x80,1,0x80);
__m128i temp1_upper = _mm_or_si128(_mm_shuffle_epi8(upper,mask1),_mm_shuffle_epi8(upper,mask2));
Though the vtbl1_u8(uint8x8_t,uint8x8_t) instruction creates a lookup table which can be used to assign values to a destination register,It only operates on 64-bit registers .Also the shuffle operation performs a comparison in the starting which has to be done in NEON and I do not know how to do that efficiently.
r0 = (mask0 & 0x80) ? 0 : SELECT(a, mask0 & 0x0f) // SELECT(a,n) extracts nth 8-bit parameter from a.
r1 = (mask1 & 0x80) ? 0 : SELECT(a, mask1 & 0x0f)
...
I cannot find an instruction which first checks the high bit of mask and then selects the lower 4-bits of the mask efficiently.I know that we can compare each bit in the register and then select lower 4 bits if the condition is specified ,But I was hoping to do it efficiently.Hope someone can help or provide a reference.
Thanks a lot,
Cheers!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
当索引超出范围时,VTBL 返回 0。
由于它最多支持两个 Q 寄存器作为查找表,因此非常简单:
)会成功的。
如果您希望位 4~6 不受影响,您可以在 vtbl 之前将它们屏蔽掉。
不幸的是,VBIC 对于 8 位立即数来说绝对没有用处。
因此,您必须牺牲一个初始化为位掩码操作数的寄存器。
VTBL returns 0 when the index is out of range.
Since it supports up to two Q registers as the lookup table, it would be quite simple :
That will do the trick.
If you want the bits 4~6 to stay out of the way, you can mask them out prior to vtbl.
Unfortunately, VBIC is absolutely useless for 8bit immediate.
Therefore, you have to sacrifice a register initialized as the bit mask operand.
您只需要使用
vtbl2_u8
两次,分割输入并适当地连接输出:正如 Jake 所说,每当索引超出范围时
vtbl
返回 0,所以您应该'不需要对0x80
情况进行任何特殊处理。You just need to use
vtbl2_u8
twice, splitting the input and joining the output appropriately:As Jake said,
vtbl
returns 0 whenever the index is out of range, so you shouldn't need any special handling for the0x80
case.