SSE 和 NEON Intrinsics-Shuffling 之间的转换

发布于 2024-12-13 09:34:08 字数 1122 浏览 7 评论 0原文

我正在尝试将以 SSE3 内在函数编写的代码转换为 NEON SIMD，但由于随机播放功能而陷入困境。我查看了 GCC 内在s，ARM 手册和其他论坛但尚未能够找到解决办法。

代码：

_m128i upper = _mm_loadu_si128((__m128i*)p1);

register __m128i mask1 = _mm_set_epi8 (0x80,0x80,0x80,0x80,0x80,0x80,0x80,12,0x80,10,0x80,7,0x80,4,0x80,1);
register __m128i mask2 = _mm_set_epi8 (0x80,0x80,0x80,0x80,0x80,0x80,12,0x80,10,0x80,7,0x80,4,0x80,1,0x80);
__m128i temp1_upper = _mm_or_si128(_mm_shuffle_epi8(upper,mask1),_mm_shuffle_epi8(upper,mask2));

虽然 vtbl1_u8(uint8x8_t,uint8x8_t) 指令创建了一个可用于将值分配给目标寄存器的查找表，但它仅在 64 位寄存器上运行。此外，shuffle 操作在开始时执行比较，这必须是用 NEON 完成，我不知道如何有效地做到这一点。

r0 = (掩码0 & 0x80) ? 0 : SELECT(a, mask0 & 0x0f) // SELECT(a,n) 从 a 中提取第 n 个 8 位参数。
r1 = (掩码1 & 0x80) ? 0：选择（a，掩码1和0x0f）
...

我找不到一条指令首先检查掩码的高位，然后有效地选择掩码的低 4 位。我知道我们可以比较寄存器然后如果指定条件则选择低4位，但我希望能够有效地完成它。希望有人可以帮助或提供参考。

非常感谢，

干杯！

原文

I am trying to convert a code written in SSE3 intrinsics to NEON SIMD and am stuck because of a shuffle function.I have looked at the GCC Intrinsics ,ARM manuals and other forums but have not been able to find a solution.

CODE:

_m128i upper = _mm_loadu_si128((__m128i*)p1);

register __m128i mask1 = _mm_set_epi8 (0x80,0x80,0x80,0x80,0x80,0x80,0x80,12,0x80,10,0x80,7,0x80,4,0x80,1);
register __m128i mask2 = _mm_set_epi8 (0x80,0x80,0x80,0x80,0x80,0x80,12,0x80,10,0x80,7,0x80,4,0x80,1,0x80);
__m128i temp1_upper = _mm_or_si128(_mm_shuffle_epi8(upper,mask1),_mm_shuffle_epi8(upper,mask2));

Though the vtbl1_u8(uint8x8_t,uint8x8_t) instruction creates a lookup table which can be used to assign values to a destination register,It only operates on 64-bit registers .Also the shuffle operation performs a comparison in the starting which has to be done in NEON and I do not know how to do that efficiently.

r0 = (mask0 & 0x80) ? 0 : SELECT(a, mask0 & 0x0f) // SELECT(a,n) extracts nth 8-bit parameter from a.
r1 = (mask1 & 0x80) ? 0 : SELECT(a, mask1 & 0x0f)
...

I cannot find an instruction which first checks the high bit of mask and then selects the lower 4-bits of the mask efficiently.I know that we can compare each bit in the register and then select lower 4 bits if the condition is specified ,But I was hoping to do it efficiently.Hope someone can help or provide a reference.

Thanks a lot,

Cheers!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

枫以 2024-12-20 09:34:08

当索引超出范围时，VTBL 返回 0。

由于它最多支持两个 Q 寄存器作为查找表，因此非常简单：

将查找表加载到 Q 寄存器中（例如 Q8）
vtbl.8 d0, {q8}, d0 （其中 d0 包含您的掩码

）会成功的。

如果您希望位 4~6 不受影响，您可以在 vtbl 之前将它们屏蔽掉。

不幸的是，VBIC 对于 8 位立即数来说绝对没有用处。

因此，您必须牺牲一个初始化为位掩码操作数的寄存器。

vmov.u8, d1, #0x70
将查找表加载到 Q 寄存器中（例如 Q8）
vbic.i8 d0, d0, d1
vtbl.8 d0, {q8}, d0 （其中 d0 包含您的掩码）

回复收藏 0 原文

江心雾 2024-12-20 09:34:08

您只需要使用 vtbl2_u8 两次，分割输入并适当地连接输出：

#define uint8x16_to_8x8x2(v) ((uint8x8x2_t) { vget_low_u8(v), vget_high_u8(v) })

uint8x16_t a = { 0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff };
uint8x16_t b = { 0x80, 0x0f, 0x01, 0x0e, 0x02, 0x0d, 0x03, 0x0c, 0x04, 0x0b, 0x05, 0x0a, 0x06, 0x09, 0x07, 0x08 };
uint8x16_t c = vcombine_u8(vtbl2_u8(uint8x16_to_8x8x2(a), vget_low_u8(b)), vtbl2_u8(uint8x16_to_8x8x2(a), vget_high_u8(b)));
// c = 00 ff 11 ee 22 dd 33 cc 44 bb 55 aa 66 99 77 88

正如 Jake 所说，每当索引超出范围时 vtbl 返回 0，所以您应该'不需要对 0x80 情况进行任何特殊处理。

You just need to use vtbl2_u8 twice, splitting the input and joining the output appropriately:

#define uint8x16_to_8x8x2(v) ((uint8x8x2_t) { vget_low_u8(v), vget_high_u8(v) })

uint8x16_t a = { 0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88, 0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff };
uint8x16_t b = { 0x80, 0x0f, 0x01, 0x0e, 0x02, 0x0d, 0x03, 0x0c, 0x04, 0x0b, 0x05, 0x0a, 0x06, 0x09, 0x07, 0x08 };
uint8x16_t c = vcombine_u8(vtbl2_u8(uint8x16_to_8x8x2(a), vget_low_u8(b)), vtbl2_u8(uint8x16_to_8x8x2(a), vget_high_u8(b)));
// c = 00 ff 11 ee 22 dd 33 cc 44 bb 55 aa 66 99 77 88

As Jake said, vtbl returns 0 whenever the index is out of range, so you shouldn't need any special handling for the 0x80 case.

回复收藏 0 原文

~没有更多了~