gcc 内存对齐编译指示
gcc 是否具有内存对齐 pragma,类似于 Intel 编译器中的 #pragma vectorlined? 我想告诉编译器使用对齐的加载/存储指令来优化特定循环。为了避免可能的混淆,这与结构打包无关。
例如:
#if defined (__INTEL_COMPILER)
#pragma vector aligned
#endif
for (int a = 0; a < int(N); ++a) {
q10 += Ix(a,0,0)*Iy(a,1,1)*Iz(a,0,0);
q11 += Ix(a,0,0)*Iy(a,0,1)*Iz(a,1,0);
q12 += Ix(a,0,0)*Iy(a,0,0)*Iz(a,0,1);
q13 += Ix(a,1,0)*Iy(a,0,0)*Iz(a,0,1);
q14 += Ix(a,0,0)*Iy(a,1,0)*Iz(a,0,1);
q15 += Ix(a,0,0)*Iy(a,0,0)*Iz(a,1,1);
}
谢谢
Does gcc have memory alignment pragma, akin #pragma vector aligned
in Intel compiler?
I would like to tell compiler to optimize particular loop using aligned loads/store instructions. to avoid possible confusion, this is not about struct packing.
e.g:
#if defined (__INTEL_COMPILER)
#pragma vector aligned
#endif
for (int a = 0; a < int(N); ++a) {
q10 += Ix(a,0,0)*Iy(a,1,1)*Iz(a,0,0);
q11 += Ix(a,0,0)*Iy(a,0,1)*Iz(a,1,0);
q12 += Ix(a,0,0)*Iy(a,0,0)*Iz(a,0,1);
q13 += Ix(a,1,0)*Iy(a,0,0)*Iz(a,0,1);
q14 += Ix(a,0,0)*Iy(a,1,0)*Iz(a,0,1);
q15 += Ix(a,0,0)*Iy(a,0,0)*Iz(a,1,1);
}
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以通过使用 typedef 创建一个可以声明指向的指针的过度对齐类型,告诉 GCC 指针指向对齐的内存。
这对 gcc 有帮助,但对 clang7.0 或 ICC19 没有帮助,请查看它们发出的 x86-64 非 AVX asm Godbolt。 (只有 GCC 将加载折叠到
mulps
的内存操作数中,而不是使用单独的movups
)。如果您想向 GCC 本身以外的 GNU C 编译器可移植地传达对齐承诺,则必须使用 __builtin_assume_aligned 。来自 http://gcc.gnu.org/onlinedocs/gcc/Type-Attributes .html
这不会使
aligned_double
为 16 字节宽。这只会使其与 16 字节边界对齐,或者更确切地说是数组中的第一个边界。看看我计算机上的反汇编,一旦我使用对齐指令,我就开始看到很多向量操作。我目前使用的是 Power 架构计算机,因此它是 altivec 代码,但我认为这可以满足您的需求。(注意:测试时我没有使用 double ,因为 altivec 不支持双浮点数。)
您可以在此处查看使用类型属性的其他一些自动向量化示例:http://gcc.gnu.org/projects/tree-ssa/vectorization.html
You can tell GCC that a pointer points to aligned memory by using a typedef to create an over-aligned type that you can declare pointers to.
This helps gcc but not clang7.0 or ICC19, see the x86-64 non-AVX asm they emit on Godbolt. (Only GCC folds a load into a memory operand for
mulps
, instead of using a separatemovups
). You have have to use__builtin_assume_aligned
if you want to portably convey an alignment promise to GNU C compilers other than GCC itself.From http://gcc.gnu.org/onlinedocs/gcc/Type-Attributes.html
This won't make
aligned_double
16 bytes wide. This will just make it aligned to a 16-byte boundary, or rather the first one in an array will be. Looking at the disassembly on my computer, as soon as I use the alignment directive, I start to see a LOT of vector ops. I am using a Power architecture computer at the moment so it's altivec code, but I think this does what you want.(Note: I wasn't using
double
when I tested this, because there altivec doesn't support double floats.)You can see some other examples of autovectorization using the type attributes here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
我使用 g++ 版本 4.5.2(Ubuntu 和 Windows)尝试了您的解决方案,它没有矢量化循环。
如果删除对齐属性,则会使用未对齐的负载对循环进行矢量化。
如果函数是内联的,以便可以在消除指针的情况下直接访问数组,则它会通过对齐负载进行矢量化。
在这两种情况下,对齐属性都会阻止矢量化。这很讽刺:“aligned_double *x”本来应该启用矢量化,但它却做了相反的事情。
哪个编译器为您报告了矢量化循环?我怀疑这不是gcc编译器?
I tried your solution with g++ version 4.5.2 (both Ubuntu and Windows) and it did not vectorize the loop.
If the alignment attribute is removed then it vectorizes the loop, using unaligned loads.
If the function is inlined so that the array can be accessed directly with the pointer eliminated, then it is vectorized with aligned loads.
In both cases, the alignment attribute prevents vectorization. This is ironic: The "aligned_double *x" was supposed to enable vectorization but it does the opposite.
Which compiler was it that reported vectorized loops for you? I suspect it was not a gcc compiler?
看起来较新版本的 GCC 有
__builtin_assume_aligned
:根据 2010 年左右 Stack Overflow 上的一些其他问题和答案,似乎内置功能在 GCC 3 和早期的 GCC 4 中不可用。但我不知道截止点在哪里。
It looks like newer versions of GCC have
__builtin_assume_aligned
:Based on some other questions and answers on Stack Overflow circa 2010, it appears the built-in was not available in GCC 3 and early GCC 4. But I do not know where the cut-off point is.