使用 GCC 进行循环版本控制

发布于 2024-08-11 08:22:33 字数 1114 浏览 3 评论 0原文

我正在使用 GCC 进行自动矢量化。由于客户要求,我无法使用内在函数或属性。 (我无法获取用户输入来支持向量化)

如果可以向量化的数组的对齐信息未知,GCC 会调用“循环版本控制”的过程。当在树上完成循环矢量化时,将执行循环版本控制。当循环被识别为可矢量化时,并且数据对齐或数据依赖性的约束阻碍了它(因为它们无法在编译时确定),则将生成循环的两个版本。这些是循环的矢量化和非矢量化版本,以及运行时检查的对齐或依赖性以控​​制执行哪个版本。

我的问题是我们如何强制执行对齐?如果我找到了一个可矢量化的循环,则由于缺少对齐信息,我不应该生成该循环的两个版本。

例如。考虑以下代码

short a[15]; short b[15]; short c[15];
int i;

void foo()
{
    for (i=0; i<15; i++)
    {
      a[i] = b[i] ;
    }
}

树转储(选项:-fdump-tree-optimized -ftree-vectorize)

<SNIP>
     vector short int * vect_pa.49;
     vector short int * vect_pb.42;
     vector short int * vect_pa.35;
     vector short int * vect_pb.30;

    bb 2>:
     vect_pb.30 = (vector short int *) &b;
     vect_pa.35 = (vector short int *) &a;
     if (((signed char) vect_pa.35 | (signed char) vect_pb.30) & 3 == 0)    ;; <== (A)
       goto <bb 3>;
     else
       goto <bb 4>;

    bb 3>:
</SNIP>

在“bb 3”版本中生成矢量化代码。在“bb 4”处生成没有矢量化的代码。这些是通过检查对齐来完成的(语句“A”)。现在,在不使用内在函数和其他属性的情况下,我应该如何仅获取矢量化代码(没有此运行时对齐检查。)

I am working on auto vectorization with GCC. I am not in a position to use intrinsics or attributes due to customer requirement. (I cannot get user input to support vectorization)

If the alignment information of the array that can be vectorized is unknown, GCC invokes a pass for 'loop versioning'. Loop versioning will be performed when loop vectorization is done on trees. When a loop is identified to be vectorizable, and the constraint on data alignment or data dependence is hindering it, (because they cannot be determined at compile time), then two versions of the loop will be generated. These are the vectorized and non-vectorized versions of the loop along with runtime checks for alignment or dependence to control which version is executed.

My question is how we have to enforce the alignment? If I have found a loop that is vectorizable, I should not generate two versions of the loop because of missing alignment information.

For example. Consider the below code

short a[15]; short b[15]; short c[15];
int i;

void foo()
{
    for (i=0; i<15; i++)
    {
      a[i] = b[i] ;
    }
}

Tree dump (options: -fdump-tree-optimized -ftree-vectorize)

<SNIP>
     vector short int * vect_pa.49;
     vector short int * vect_pb.42;
     vector short int * vect_pa.35;
     vector short int * vect_pb.30;

    bb 2>:
     vect_pb.30 = (vector short int *) &b;
     vect_pa.35 = (vector short int *) &a;
     if (((signed char) vect_pa.35 | (signed char) vect_pb.30) & 3 == 0)    ;; <== (A)
       goto <bb 3>;
     else
       goto <bb 4>;

    bb 3>:
</SNIP>

At 'bb 3' version of vectorized code is generated. At 'bb 4' code without vectorization is generated. These are done by checking the alignment (statement 'A'). Now without using intrinsics and other attributes, how should I get only the vectorized code (without this runtime alignment check.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

混吃等死 2024-08-18 08:22:33

如果相关数据是静态分配的,那么您可以使用 GCC 支持的 __align__ 属性来指定它应该与必要的边界对齐。如果要动态分配这些数组,则可以按对齐值进行过度分配,然后将返回的指针向上提升到所需的对齐方式。

如果您所在的系统支持 posix_memalign() 函数,您还可以使用它。最后,请注意,malloc() 将始终分配与最大内置类型的大小对齐的内存,对于双精度型,通常为 8 个字节。如果您不需要比这更好的东西,那么 malloc 就足够了。

编辑:如果您修改分配代码以强制该检查为真(即过度分配,如上所述),则编译器应该不条件化循环代码。如果您需要对齐 8 字节边界(看起来如此),则类似于 a = (a + 7) & ~3;

If the data in question is being allocated statically, then you can use the __align__ attribute that GCC supports to specify that it should be aligned to the necessary boundary. If you are dynamically allocating these arrays, you can over-allocate by the alignment value, and then bump the returned pointer up to the alignment you need.

You can also use the posix_memalign() function if you're on a system that supports it. Finally, note that malloc() will always allocate memory aligned to the size of the largest built-in type, generally 8 bytes for a double. If you don't need better than that, then malloc should suffice.

Edit: If you modify your allocation code to force that check to be true (i.e. overallocate, as suggested above), the compiler should oblige by not conditionalizing the loop code. If you needed alignment to an 8-byte boundary, as it seems, that would be something like a = (a + 7) & ~3;.

我的痛♀有谁懂 2024-08-18 08:22:33

我只得到一个版本的循环,使用您的确切代码和以下选项:gcc -march=core2 -c -O2 -fdump-tree-optimized -ftree-vectorize vec.c< /code>

我的 GCC 版本是 gcc 版本 4.4.1 (Ubuntu 4.4.1-4ubuntu8)

GCC 在这里做了一些聪明的事情。它强制数组 ab 进行 16 字节对齐。它不会对 c 执行此操作,大概是因为 c 从未在可矢量化循环中使用。

I get only one version of the loop, using your exact code with these options: gcc -march=core2 -c -O2 -fdump-tree-optimized -ftree-vectorize vec.c

My version of GCC is gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu8).

GCC is doing something clever here. It forces the arrays a and b to be 16-byte aligned. It doesn't do that to c, presumably because c is never used in a vectorizable loop.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文