计算 128 位 avx 向量中唯一值的数量,或检测所有元素是否相等?
我正在优化代码库中的热路径,并且已经转向矢量化。请记住,我对所有这些 SIMD 内容还很陌生。这是我试图解决的问题,
inline int count_unique(int c1, int c2, int c3, int c4)
{
return 4 - (c2 == c1)
- ((c3 == c1) || (c3 == c2))
- ((c4 == c1) || (c4 == c2) || (c4 == c3));
}
使用 -O3
编译后的汇编输出使用非 SIMD 实现:
count_unique:
xor eax, eax
cmp esi, edi
mov r8d, edx
setne al
add eax, 3
cmp edi, edx
sete dl
cmp esi, r8d
sete r9b
or edx, r9d
movzx edx, dl
sub eax, edx
cmp edi, ecx
sete dl
cmp r8d, ecx
sete dil
or edx, edi
cmp esi, ecx
sete cl
or edx, ecx
movzx edx, dl
sub eax, edx
ret
将 c1,c2,c3,c4 存储为16字节整数向量?
I'm optimizing a hot path in my codebase and i have turned to vectorization. Keep in mind, I'm still quite new to all of this SIMD stuff. Here is the problem I'm trying to solve, implemented using non-SIMD
inline int count_unique(int c1, int c2, int c3, int c4)
{
return 4 - (c2 == c1)
- ((c3 == c1) || (c3 == c2))
- ((c4 == c1) || (c4 == c2) || (c4 == c3));
}
the assembly output after compiling with -O3
:
count_unique:
xor eax, eax
cmp esi, edi
mov r8d, edx
setne al
add eax, 3
cmp edi, edx
sete dl
cmp esi, r8d
sete r9b
or edx, r9d
movzx edx, dl
sub eax, edx
cmp edi, ecx
sete dl
cmp r8d, ecx
sete dil
or edx, edi
cmp esi, ecx
sete cl
or edx, ecx
movzx edx, dl
sub eax, edx
ret
How would something like this be done when storing c1,c2,c3,c4 as a 16byte integer vector?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对于您的简化问题(测试所有 4 个通道是否相等),我会稍微不同,方法如下。这样整个测试只需要 3 条指令。
For your simplified problem (test all 4 lanes for equality), I would do it slightly differently, here’s how. This way it only takes 3 instructions for the complete test.
好的,我已经“简化”了这个问题,因为我使用唯一计数的唯一情况是它是否为 1,但这与检查所有元素是否相同相同,这可以通过比较输入来完成与自身,但使用
_mm_alignr_epi8
函数移动一个元素(4 个字节)。Ok, I have "simplified" the problem, because the only case when i was using the unique count, was if it was 1, but that is the same as checking if all elements are the same, which can be done by comparing the input with itself, but shifted over by one element (4 bytes) using the
_mm_alignr_epi8
function.