使用SSE / AVX intinisics时体系结构的效果

发布于 2025-01-29 02:22:19 字数 857 浏览 3 评论 0 原文

我想知道编译器如何处理内在。

如果使用SSE2 Intrinsics(使用 #include< emmintrin.h> ),并使用 -mavx flag进行编译。编译器将产生什么?它会生成AVX还是SSE代码?

如果使用AVX2 Intrinsics(使用 #include< inmintrin.h> ),并使用 -MSSE2 flag编译。编译器将产生什么?它将仅生成SSE还是AVX代码?

编译器如何处理内在的?
如果有人使用内在,它是否有助于编译器了解循环中的依赖性以更好地进行矢量化?

例如,这里发生了什么 - https://godbolt.org/z/z/y4j5oa (或) a href =“ https://godbolt.org/z/lzoj2k” rel =“ nofollow noreferrer”> https://godbolt.org/z/lzoj2k )?
)? 查看所有三个窗格。

上下文

我试图使用不同的CPU功能(SSE4和AVX2)构建相同功能的各种版本的 。
我正在编写具有SSE Intersics的同一版本,并且曾经使用AVX Interinsics。
假设他们是名称 myFunsse() myfunavx()。两者都在同一文件中。

如何使编译器(相同的方法适用于MSVC,GCC和ICC)仅使用相应的功能构建每个功能?

I wonder how does a Compiler treats Intrinsics.

If one uses SSE2 Intrinsics (Using #include <emmintrin.h>) and compile with -mavx flag. What will the compiler generate? Will it generate AVX or SSE code?

If one uses AVX2 Intrinsics (Using #include <immintrin.h>) and compile with -msse2 flag. What will the compiler generate? Will it generate SSE Only or AVX code?

How does compilers treat Intrinsics?
If one uses Intrinsics, does it help the compiler understand the dependency in the loop for better vectorization?

For instance, what's going on here - https://godbolt.org/z/Y4J5OA (Or https://godbolt.org/z/LZOJ2K)?
See all 3 panes.

The Context

I'm trying to build various version of the same functions with different CPU features (SSE4 and AVX2).
I'm writing the same version one with SSE Intrinsics and once with AVX Intrinsics.
Let's say theyare name MyFunSSE() and MyFunAVX(). Both are in the same file.

How can I make the Compiler (Same method should work for MSVC, GCC and ICC) build each of them using only the respective functions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

一城柳絮吹成雪 2025-02-05 02:22:19

GCC和Clang要求您启用所有使用的扩展。否则,这是一个编译时错误,例如错误:inlling inlling inlling inlling inlling drompall ewlast_inline 错误:在呼叫ewland_inline'__m256d _mm256_mask_mask_loadu_pd(__ m256d,__mmask8,const void*)':目标特定选项不匹配

使用 code> -mmarch = -march = -March = Haswell 或与启用特定扩展相比的任何优先选择,因为这也设置了适当的调整选项。而且,您不会忘记有用的 -mpopcnt ,它将 std :: bitset :: count() inline a popcnt 指令,并且通过BMI2 SHLX / SHRX (1 UOP vs. 3)


MSVC和ICC不使用您使用固有信息来散发他们无法自动进行自动化的说明。

如果您使用AVX Interins,则应该肯定会启用AVX。较旧的MSVC而不启用AVX并不总是在需要的情况下自动使用 vzeroupper ,但是已经固定了几年。不过,如果您的整个程序可以假设AVX支持,则一定要告诉编译器,即使对于MSVC,也可以告诉它。


对于支持GNU扩展名的编译器(GCC,Clang,ICC),您可以使用 __属性__((目标(target(“ avx”)))))在编译单元中的特定功能上。或者更好, __属性__((目标(“ Arch = Haswell”)))也许还设置了调谐选项。 (这也启用了您可能不需要的AVX2和FMA。我不确定 target 属性是否设置 -mtune = xx )。看

__属性__((target(target()))将阻止它们与其他目标选项一起将与其他目标选项联系起来,因此请小心在如果功能本身太小,它们将融入其中的功能。在包含循环的函数上使用它,而不是循环中调用的辅助功能。

参见
https://gcc.gc.gnu.org/wiki/functionmultiversmmultiversmitioning 用于在多个目标选项上使用多个目标选项相同函数名称的定义,用于编译器支持运行时调度的定义。但是我认为(MSVC)没有一种方法可以做到这一点。

参见 指定编译器函数的SIMD级别可以使用 以获取有关在GCC/Clang上进行运行时调度的更多信息。


使用MSVC,您不需要任何东西,尽管像我说的那样,我认为使用没有 ark的 -ark:avx 的AVX Interins通常是一个坏主意,因此您最好将它们放在单独的文件中。但是对于AVX与AVX2 + FMA或SSE2与SSE4.2的SSE2.2,您没事。

只需 #define avx2_function 到空字符串而不是 __属性__((目标(“ avx2,fma”))))

#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
// apparently ICC doesn't support target attributes, despite supporting GNU C
#define TARGET_HASWELL __attribute__((target("arch=haswell")))
#else
#define TARGET_HASWELL   // empty
 // maybe warn if __AVX__ isn't defined for functions where this is used?
 // if you need to make sure MSVC uses vzeroupper everywhere needed.
#endif


TARGET_HASWELL
void foo_avx(float *__restrict dst, float *__restrict src)
{
   for (size_t i = 0 ; i<1024 ; i++) {
       __m256 v = _mm256_loadu_ps(src);
       ...
       ...
   }
}

使用GCC和Clang,宏扩展到> > __ Attribute __(((target)) stuff;使用MSVC和ICC,它没有。


ICC pragma:

记录了您要在AVX函数之前放置的PRAGMA使用 _MM256 内在。

#pragma intel optimization_parameter target_arch=AVX

对于ICC,您可以 #define target_avx ,并且始终在函数之前在线上使用它,您可以在其中放置 __属性__ 或pragma。如果ICC不希望在声明上,您可能还需要单独的宏来定义和声明功能。如果要在它们之后具有非AVX函数,则可以结束AVX功能的宏。 (对于非ICC编译器,这将是空的。)

GCC and clang require that you enable all extensions you use. Otherwise it's a compile-time error, like error: inlining failed to call always_inline error: inlining failed in call to always_inline ‘__m256d _mm256_mask_loadu_pd(__m256d, __mmask8, const void*)’: target specific option mismatch

Using -march=native or -march=haswell or whatever is preferred over enabling specific extensions, because that also sets appropriate tuning options. And you don't forget useful ones like -mpopcnt that will let std::bitset::count() inline a popcnt instruction, and make all variable-count shifts more efficient with BMI2 shlx / shrx (1 uop vs. 3)


MSVC and ICC do not, and will let you use intrinsics to emit instructions that they couldn't auto-vectorize with.

You should definitely enable AVX if you use AVX intrinsics. Older MSVC without enabling AVX didn't always use vzeroupper automatically where needed, but that's been fixed for a few years. Still, if your whole program can assume AVX support, definitely tell the compiler about it even for MSVC.


For compilers that support GNU extensions (GCC, clang, ICC), you can use stuff like __attribute__((target("avx"))) on specific functions in a compilation unit. Or better, __attribute__((target("arch=haswell"))) to maybe also set tuning options. (That also enables AVX2 and FMA, which you might not want. And I'm not sure if target attributes do set -mtune=xx). See
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html

__attribute__((target())) will prevent them from inlining into functions with other target options, so be careful to use this on functions they will inline into, if the function itself is too small. Use it on a function containing a loop, not a helper function called in a loop.

See also
https://gcc.gnu.org/wiki/FunctionMultiVersioning for using different target options on multiple definitions of the same function name, for compiler supported runtime dispatching. But I don't think there's a portable (to MSVC) way to do that.

See specify simd level of a function that compiler can use for more about doing runtime dispatch on GCC/clang.


With MSVC you don't need anything, although like I said I think it's normally a bad idea to use AVX intrinsics without -arch:AVX, so you might be better off putting those in a separate file. But for AVX vs. AVX2 + FMA, or SSE2 vs. SSE4.2, you're fine without anything.

Just #define AVX2_FUNCTION to the empty string instead of __attribute__((target("avx2,fma")))

#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
// apparently ICC doesn't support target attributes, despite supporting GNU C
#define TARGET_HASWELL __attribute__((target("arch=haswell")))
#else
#define TARGET_HASWELL   // empty
 // maybe warn if __AVX__ isn't defined for functions where this is used?
 // if you need to make sure MSVC uses vzeroupper everywhere needed.
#endif


TARGET_HASWELL
void foo_avx(float *__restrict dst, float *__restrict src)
{
   for (size_t i = 0 ; i<1024 ; i++) {
       __m256 v = _mm256_loadu_ps(src);
       ...
       ...
   }
}

With GCC and clang, the macro expands to the __attribute__((target)) stuff; with MSVC and ICC it doesn't.


ICC pragma:

https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-optimization-parameter documents a pragma which you'd want to put before AVX functions to make sure vzeroupper is used properly in functions that use _mm256 intrinsics.

#pragma intel optimization_parameter target_arch=AVX

For ICC, you could #define TARGET_AVX as this, and always used it on a line by itself before the function, where you can put an __attribute__ or a pragma. You might also want separate macros for defining vs. declaring functions, if ICC doesn't want this on declarations. And a macro to end a block of AVX functions, if you want to have non-AVX functions after them. (For non-ICC compilers, this would be empty.)

猫七 2025-02-05 02:22:19

如果您使用 -mavx2 编译代码,则启用了编译器(通常)将生成所谓的“ VEX编码”指令。对于 _MM_LOADU_PS ,这将生成 vmovups 而不是 Movups 几乎是等效的,除了后者只会修改下部128 lit目标登记册,而前者将使所有超过128位的所有内容归零。但是,它只会在支持至少AVX的机器上运行。 [v] Movips 上的详细信息在这里

有关其他说明,例如 [v] addps ,avx,avx还有其他附加允许三个操作数的优点(即,目标可能与两个来源不同),在某些情况下可以避免复制寄存器。例如,

_mm_mul_ps(_mm_add_ps(a,b), _mm_sub_ps(a,b));

在为SSE编译时,需要寄存器副本( Movaps ),但在为AVX编译时不需要:
https://godbolt.org/z/yhn5oa


关于使用Avx-intrinsics,但不使用AVX编译,要么失败(例如GCC/Clang),要么默默生成相应的说明,然后在没有AVX支持的情况下在机器上失败(有关此详细信息,请参见@petercordes回答)。


附录:如果要根据体系结构(在编译时)实现不同的功能,则可以使用 #IFDEF __AVX __ #########定义(__ avx __):<:< a href =“ https://godbolt.org/z/zvao-7” rel =“ nofollow noreferrer”> https://godbolt.org/z/zvao-7

在同一汇编单元中实现它们我认为困难。最简单的解决方案是构建不同的共享图形甚至不同的二进制文件,并具有一个小的二进制文件,该二进制可以检测可用的CPU功能并加载相应的库/二进制文件。我认为有关该主题的相关问题。

If you compile code with -mavx2 enabled your compiler will (usually) generate so-called "VEX encoded" instructions. In case of _mm_loadu_ps, this will generate vmovups instead of movups, which is almost equivalent, except that the latter will only modify the lower 128 bit of the target register, whereas the former will zero-out everything above the lower 128 bits. However, it will only run on machines which support at least AVX. Details on [v]movups are here.

For other instructions like [v]addps, AVX has the additional advantage of allowing three operands (i.e., the target can be different from both sources), which in some cases can avoid copying registers. E.g.,

_mm_mul_ps(_mm_add_ps(a,b), _mm_sub_ps(a,b));

requires a register copy (movaps) when compiled for SSE, but not when compiled for AVX:
https://godbolt.org/z/YHN5OA


Regarding using AVX-intrinsics but compiling without AVX, compilers either fail (like gcc/clang) or silently generate the corresponding instructions which would then fail on machines without AVX support (see @PeterCordes answer for details on that).


Addendum: If you want to implement different functions depending on the architecture (at compile-time) you can check that using #ifdef __AVX__ or #if defined(__AVX__): https://godbolt.org/z/ZVAo-7

Implementing them in the same compilation unit is difficult, I think. The easiest solutions are to built different shared-libraries or even different binaries and have a small binary which detects the available CPU features and loads the corresponding library/binary. I assume there are related questions on that topic.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文