在同一循环上使用多个Pragma在GCC和ICC上进行自动矢量化

发布于 2025-01-24 12:07:43 字数 1272 浏览 5 评论 0原文

当在简单的数组上运行一个简单的循环时，

for(int i=0;i<16;i++)
{
       a[i]=b[i]+c[i];
}

GCC和ICC的行为与布拉格斯有所不同。因此，我尝试了布拉格马斯（Pragmas），并观察到ICC受益于此：

#pragma vector always vectorlength(16)
for(int i=0;i<16;i++)
{
       a[i]=b[i]+c[i];
}

GCC受益于此：

#pragma gcc ivdep
for(int i=0;i<16;i++)
{
       a[i]=b[i]+c[i];
}

支持这两个编译器的正确方法是什么？这样的东西：

#pragma vector always vectorlength(16)
#pragma gcc ivdep
for(int i=0;i<16;i++)
{
       a[i]=b[i]+c[i];
}

或使用定义宏？（我不喜欢宏，但如果没有其他选项，我可以使用）

我试图支持#pragma op simd Safelen（16）对于没有OpenMP的平台。我发现的最接近的老将是GCC IVDEP和矢量，但仍然不如OMP的Pragma快。可能我错过了更多的布拉格马斯。

A，B和C是同一堆栈中的简单数组，它们与64对齐。
该函数具有__属性__（（（始终_inline））） 可帮助ICC进行4X性能（但仍比GCC慢50％）
ICC标志：-STD = C ++ 14 -XCORE -AVX512 -QOPT -ZMM -ISAGE = HIGH -O3 -LGOMP -LGOMP -FMATH -ERRNO -MMPREFER -VECTOR -WIDTH = 512 -FTREE -vectorize -lpthread -lpthread
GCC标志：-STD = C ++ 14 -March = Cascadelake -fmath -Errno -Mavx512f -o3 -lgomp -lgomp -mprefer -vector -vector -width = 512 -ftree -vectorize -ftree -vectorize -lpthread

没有#pragma vector始终与GCC相当吗？

原文

When there is a simple loop running on simple arrays,

for(int i=0;i<16;i++)
{
       a[i]=b[i]+c[i];
}

GCC and ICC behave differently with pragmas. So I experimented with pragmas and observed that ICC benefits from this:

#pragma vector always vectorlength(16)
for(int i=0;i<16;i++)
{
       a[i]=b[i]+c[i];
}

and GCC benefits from this:

#pragma gcc ivdep
for(int i=0;i<16;i++)
{
       a[i]=b[i]+c[i];
}

What is the right approach to support both compilers? Something like this:

#pragma vector always vectorlength(16)
#pragma gcc ivdep
for(int i=0;i<16;i++)
{
       a[i]=b[i]+c[i];
}

or using define macros? (I'm not fond of macros but can use if no other option is left)

I'm trying to support #pragma omp simd safelen(16) for platforms that do not have OpenMP. Closest pragmas I found are gcc ivdep and vector always but still they are not as fast as omp's pragma. Probably I'm missing some more pragmas.

a,b and c are simple arrays in same stack and they are aligned to 64.
the function has __attribute__((always_inline)) which helps ICC for 4x performance (but still slower than GCC by 50%)
ICC flags: -std=c++14 -xCORE-AVX512 -qopt-zmm-usage=high -O3 -lgomp -fmath-errno -mprefer-vector-width=512 -ftree-vectorize -lpthread
GCC flags: -std=c++14 -march=cascadelake -fmath-errno -mavx512f -O3 -lgomp -mprefer-vector-width=512 -ftree-vectorize -lpthread

Lastly, why is there no #pragma vector always equivalent for GCC?

分享到QQ

分享到微博