x86-64 SSE2 整数 SIMD GCC 内置函数是否有 ARM64 等效项?

发布于 2025-01-15 03:49:24 字数 558 浏览 4 评论 0原文

我尝试使用 AMM 算法(近似矩阵乘法;在 Apple 的 M1 上),该算法完全基于速度并使用下面列出的 x86 内置函数。由于使用 x86 虚拟机会减慢算法中的几个关键进程,我想知道是否有另一种方法可以在 ARM64 上运行它。

我也找不到 ARM64 内置函数的合适文档,这最终可以帮助映射一些 x86-64 指令。

使用的内置函数:

__builtin_ia32_vec_init_v2si
__builtin_ia32_vec_ext_v2si
__builtin_ia32_packsswb
__builtin_ia32_packssdw
__builtin_ia32_packuswb
__builtin_ia32_punpckhbw
__builtin_ia32_punpckhwd
__builtin_ia32_punpckhdq
__builtin_ia32_punpcklbw
__builtin_ia32_punpcklwd
__builtin_ia32_punpckldq
__builtin_ia32_paddb
__builtin_ia32_paddw
__builtin_ia32_paddd

Im trying to use an AMM-Algorithm (approximate-matrix-multiplication; on Apple's M1), which is fully based on speed and uses the x86 built-in functions listed below. Since using a VM for x86 slows down several crucial processes in the algorithm, I was wondering if there is another way to run it on ARM64.

I also could not find a fitting documentation for the ARM64 built-in functions, which could eventually help mapping some of the x86-64 instructions.

Used built-in functions:

__builtin_ia32_vec_init_v2si
__builtin_ia32_vec_ext_v2si
__builtin_ia32_packsswb
__builtin_ia32_packssdw
__builtin_ia32_packuswb
__builtin_ia32_punpckhbw
__builtin_ia32_punpckhwd
__builtin_ia32_punpckhdq
__builtin_ia32_punpcklbw
__builtin_ia32_punpcklwd
__builtin_ia32_punpckldq
__builtin_ia32_paddb
__builtin_ia32_paddw
__builtin_ia32_paddd

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

阳光下慵懒的猫 2025-01-22 03:49:24

通常您会使用内在函数而不是原始 GCC 内置函数,但请参阅 https://gcc.gnu.org/onlinedocs/gcc/ARM-C-Language-Extensions-_0028ACLE_0029.html。像 __builtin_aarch64_saddl2v16qi 这样的 __builtin_arm_...__builtin_aarch64_... 函数似乎没有像 x86 那样记录在 GCC 手册中是,只是它们不适合直接使用的另一个标志。

另请参阅 https://developer.arm.com/documentation/102467 /0100/Why-Neon-Intrinsics- re 内在函数和 #include。 GCC 提供了该标头的一个版本,其中记录了使用 __builtin_aarch64_... GCC 内置函数实现的内部函数 API。


就可移植性库而言,据我所知不是来自原始内置函数,而是 SIMDe (https://github .com/simd-everywhere/simde)具有 immintrin.h Intel 内在函数的可移植实现,例如 _mm_packs_epi16。大多数代码应该使用该 API 而不是 GNU C 内置函数,除非您使用 GNU C 本机向量 (__attribute__((vector_size(16))) 来实现可移植 SIMD,而无需任何 ISA 特定的内容。但是当你想利用特殊的洗牌和其他东西时,这是不可行的

,是的,ARM 确实通过 vqmovn (https://developer.arm.com/documentation/dui0473/ m/neon-instructions/vqmovn-and-vqmovun),因此 SIMDe 可以有效地模拟包指令,而不是 AArch32。 64,但希望有等效的 AArch64 指令。

Normally you'd use intrinsics instead of the raw GCC builtin functions, but see https://gcc.gnu.org/onlinedocs/gcc/ARM-C-Language-Extensions-_0028ACLE_0029.html. The __builtin_arm_... and __builtin_aarch64_... functions like __builtin_aarch64_saddl2v16qi don't seem to be documented in the GCC manual the way the x86 ones are, just another sign they're not intended for direct use.

See also https://developer.arm.com/documentation/102467/0100/Why-Neon-Intrinsics- re intrinsics and #include <arm_neon.h>. GCC provides a version of that header, with the documented intrinsics API implemented using __builtin_aarch64_... GCC builtins.


As far as portability libraries, AFAIK not from the raw builtins, but SIMDe (https://github.com/simd-everywhere/simde) has portable implementations of immintrin.h Intel intrinsics like _mm_packs_epi16. Most code should be using that API instead of GNU C builtins, unless you're using GNU C native vectors (__attribute__((vector_size(16))) for portable SIMD without any ISA-specific stuff. But that's not viable when you want to take advantage of special shuffles and stuff.

And yes, ARM does have narrowing with saturation with instructions like vqmovn (https://developer.arm.com/documentation/dui0473/m/neon-instructions/vqmovn-and-vqmovun), so SIMDe can efficiently emulate pack instructions. That's AArch32, not 64, but hopefully there's an equivalent AArch64 instruction.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文