x86-64 SSE2 整数 SIMD GCC 内置函数是否有 ARM64 等效项?
我尝试使用 AMM 算法(近似矩阵乘法;在 Apple 的 M1 上),该算法完全基于速度并使用下面列出的 x86 内置函数。由于使用 x86 虚拟机会减慢算法中的几个关键进程,我想知道是否有另一种方法可以在 ARM64 上运行它。
我也找不到 ARM64 内置函数的合适文档,这最终可以帮助映射一些 x86-64 指令。
使用的内置函数:
__builtin_ia32_vec_init_v2si
__builtin_ia32_vec_ext_v2si
__builtin_ia32_packsswb
__builtin_ia32_packssdw
__builtin_ia32_packuswb
__builtin_ia32_punpckhbw
__builtin_ia32_punpckhwd
__builtin_ia32_punpckhdq
__builtin_ia32_punpcklbw
__builtin_ia32_punpcklwd
__builtin_ia32_punpckldq
__builtin_ia32_paddb
__builtin_ia32_paddw
__builtin_ia32_paddd
Im trying to use an AMM-Algorithm (approximate-matrix-multiplication; on Apple's M1), which is fully based on speed and uses the x86 built-in functions listed below. Since using a VM for x86 slows down several crucial processes in the algorithm, I was wondering if there is another way to run it on ARM64.
I also could not find a fitting documentation for the ARM64 built-in functions, which could eventually help mapping some of the x86-64 instructions.
Used built-in functions:
__builtin_ia32_vec_init_v2si
__builtin_ia32_vec_ext_v2si
__builtin_ia32_packsswb
__builtin_ia32_packssdw
__builtin_ia32_packuswb
__builtin_ia32_punpckhbw
__builtin_ia32_punpckhwd
__builtin_ia32_punpckhdq
__builtin_ia32_punpcklbw
__builtin_ia32_punpcklwd
__builtin_ia32_punpckldq
__builtin_ia32_paddb
__builtin_ia32_paddw
__builtin_ia32_paddd
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
通常您会使用内在函数而不是原始 GCC 内置函数,但请参阅 https://gcc.gnu.org/onlinedocs/gcc/ARM-C-Language-Extensions-_0028ACLE_0029.html。像
__builtin_aarch64_saddl2v16qi
这样的__builtin_arm_...
和__builtin_aarch64_...
函数似乎没有像 x86 那样记录在 GCC 手册中是,只是它们不适合直接使用的另一个标志。另请参阅 https://developer.arm.com/documentation/102467 /0100/Why-Neon-Intrinsics- re 内在函数和
#include
。 GCC 提供了该标头的一个版本,其中记录了使用__builtin_aarch64_...
GCC 内置函数实现的内部函数 API。就可移植性库而言,据我所知不是来自原始内置函数,而是 SIMDe (https://github .com/simd-everywhere/simde)具有
immintrin.h
Intel 内在函数的可移植实现,例如_mm_packs_epi16
。大多数代码应该使用该 API 而不是 GNU C 内置函数,除非您使用 GNU C 本机向量 (__attribute__((vector_size(16)))
来实现可移植 SIMD,而无需任何 ISA 特定的内容。但是当你想利用特殊的洗牌和其他东西时,这是不可行的,是的,ARM 确实通过
vqmovn
(https://developer.arm.com/documentation/dui0473/ m/neon-instructions/vqmovn-and-vqmovun),因此 SIMDe 可以有效地模拟包指令,而不是 AArch32。 64,但希望有等效的 AArch64 指令。Normally you'd use intrinsics instead of the raw GCC builtin functions, but see https://gcc.gnu.org/onlinedocs/gcc/ARM-C-Language-Extensions-_0028ACLE_0029.html. The
__builtin_arm_...
and__builtin_aarch64_...
functions like__builtin_aarch64_saddl2v16qi
don't seem to be documented in the GCC manual the way the x86 ones are, just another sign they're not intended for direct use.See also https://developer.arm.com/documentation/102467/0100/Why-Neon-Intrinsics- re intrinsics and
#include <arm_neon.h>
. GCC provides a version of that header, with the documented intrinsics API implemented using__builtin_aarch64_...
GCC builtins.As far as portability libraries, AFAIK not from the raw builtins, but SIMDe (https://github.com/simd-everywhere/simde) has portable implementations of
immintrin.h
Intel intrinsics like_mm_packs_epi16
. Most code should be using that API instead of GNU C builtins, unless you're using GNU C native vectors (__attribute__((vector_size(16)))
for portable SIMD without any ISA-specific stuff. But that's not viable when you want to take advantage of special shuffles and stuff.And yes, ARM does have narrowing with saturation with instructions like
vqmovn
(https://developer.arm.com/documentation/dui0473/m/neon-instructions/vqmovn-and-vqmovun), so SIMDe can efficiently emulate pack instructions. That's AArch32, not 64, but hopefully there's an equivalent AArch64 instruction.