为什么 GCC 不自动矢量化这个循环？

发布于 2024-10-17 18:27:55 字数 983 浏览 7 评论 0原文

我有以下 C 程序（我的实际用例的简化，表现出相同的行为）

#include <stdlib.h>
#include <math.h>
int main(int argc, char ** argv) {
    const float * __restrict__ const input = malloc(20000*sizeof(float));
    float * __restrict__ const output = malloc(20000*sizeof(float));

    unsigned int pos=0;
    while(1) {
            unsigned int rest=100;
            for(unsigned int i=pos;i<pos+rest; i++) {
                    output[i] = input[i] * 0.1;
            }

            pos+=rest;            
            if(pos>10000) {
                    break;
            }
    }
}

当我编译时，

 -O3 -g -Wall -ftree-vectorizer-verbose=5 -msse -msse2 -msse3 -march=native -mtune=native --std=c99 -fPIC -ffast-math

我得到输出，

main.c:10: note: not vectorized: unhandled data-ref

其中 10 是内部 for 循环的行。当我查找为什么它可能会这样说时，它似乎在说指针可以使用别名，但它们不能出现在我的代码中，因为我有 __restrict 关键字。他们还建议包括 -msse 标志，但他们似乎也没有做任何事情。有什么帮助吗？

原文

I have the following C program (a simplification of my actual use case which exhibits the same behavior)

#include <stdlib.h>
#include <math.h>
int main(int argc, char ** argv) {
    const float * __restrict__ const input = malloc(20000*sizeof(float));
    float * __restrict__ const output = malloc(20000*sizeof(float));

    unsigned int pos=0;
    while(1) {
            unsigned int rest=100;
            for(unsigned int i=pos;i<pos+rest; i++) {
                    output[i] = input[i] * 0.1;
            }

            pos+=rest;            
            if(pos>10000) {
                    break;
            }
    }
}

When I compile with

 -O3 -g -Wall -ftree-vectorizer-verbose=5 -msse -msse2 -msse3 -march=native -mtune=native --std=c99 -fPIC -ffast-math

I get the output

main.c:10: note: not vectorized: unhandled data-ref

where 10 is the line of the inner for loop. When I looked up why it might say this, it seemed to be saying that the pointers could be aliased, but they can't be in my code, as I have the __restrict keyword. They also suggested including the -msse flags, but they don't seem to do anything either. Any help?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蔚蓝源自深海 2024-10-24 18:27:55

这确实看起来像一个错误。在下面的等效函数中，当针对 x86-64 目标进行编译时，foo() 已矢量化，但 bar() 则不是：

void foo(const float * restrict input, float * restrict output)
{
    unsigned int pos;
    for (pos = 0; pos < 10100; pos++)
        output[pos] = input[pos] * 0.1;
}

void bar(const float * restrict input, float * restrict output)
{
    unsigned int pos;
    unsigned int i;
    for (pos = 0; pos <= 10000; pos += 100)
        for (i = 0; i < 100; i++)
            output[pos + i] = input[pos + i] * 0.1;
}

添加 -m32< /code> 标志，改为编译 x86 目标，导致两个函数都被矢量化。

It certainly seems like a bug. In the following, equivalent functions, foo() is vectorised but bar() is not, when compiling for an x86-64 target:

void foo(const float * restrict input, float * restrict output)
{
    unsigned int pos;
    for (pos = 0; pos < 10100; pos++)
        output[pos] = input[pos] * 0.1;
}

void bar(const float * restrict input, float * restrict output)
{
    unsigned int pos;
    unsigned int i;
    for (pos = 0; pos <= 10000; pos += 100)
        for (i = 0; i < 100; i++)
            output[pos + i] = input[pos + i] * 0.1;
}

Adding the -m32 flag, to compile for an x86 target instead, causes both functions to be vectorised.

回复收藏 0 原文

維他命╮ 2024-10-24 18:27:55

它不喜欢外循环格式，这会阻止它理解内循环。如果我只是将其折叠成一个循环，我就可以对其进行矢量化：（

#include <stdlib.h>
#include <math.h>
int main(int argc, char ** argv) {
    const float * __restrict__ input = malloc(20000*sizeof(float));
    float * __restrict__ output = malloc(20000*sizeof(float));

    for(unsigned int i=0; i<=10100; i++) {
            output[i] = input[i] * 0.1f;
    }
}

请注意，我并没有认真考虑如何将 pos+rest 限制正确地转换为单个 for 循环条件，这可能是错误的）

您可能可以通过将简化的内部循环放入使用指针和计数调用的函数中来利用这一点。即使再次内联它也可以正常工作。这是假设您删除了我刚刚简化的 while() 循环的部分内容，但您需要保留它们。

It doesn't like the outer loop format which is preventing it from understanding the inner loop. I can get it to vectorize if I just fold it into a single loop:

#include <stdlib.h>
#include <math.h>
int main(int argc, char ** argv) {
    const float * __restrict__ input = malloc(20000*sizeof(float));
    float * __restrict__ output = malloc(20000*sizeof(float));

    for(unsigned int i=0; i<=10100; i++) {
            output[i] = input[i] * 0.1f;
    }
}

(note that I didn't think too hard about how to properly translate the pos+rest limit into a single for loop condition, it may be wrong)

You might be able to take advantage of this by putting a simplified inner loop into a function which you call with pointers and a count. Even when it is inlined again it may work fine. This is assuming you deleted parts of your while() loop that I have just simplified away but you need to retain.

回复收藏 0 原文

原谅我要高飞 2024-10-24 18:27:55

尝试：

const float * __restrict__ input = ...;
float * __restrict__ output = ...;

通过改变周围的事物来进行一些实验：

#include <stdlib.h>
#include <math.h>

int main(int argc, char ** argv) {

    const float * __restrict__ input = new float[20000];
    float * __restrict__  output = new float[20000];

    unsigned int pos=0;
    while(1) {
        unsigned int rest=100;
        output += pos;
        input += pos;
        for(unsigned int i=0;i<rest; ++i) {
            output[i] = input[i] * 0.1;
        }

        pos+=rest;
        if(pos>10000) {
            break;
        }
    }
}

g++ -O3 -g -Wall -ftree-vectorizer-verbose=7 -msse -msse2 -msse3 -c test.cpp

test.cpp:14: note: versioning for alias required: can't determine dependence between *D.4096_24 and *D.4095_21
test.cpp:14: note: mark for run-time aliasing test between *D.4096_24 and *D.4095_21
test.cpp:14: note: Alignment of access forced using versioning.
test.cpp:14: note: Vectorizing an unaligned access.
test.cpp:14: note: vect_model_load_cost: unaligned supported by hardware.
test.cpp:14: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
test.cpp:14: note: vect_model_simple_cost: inside_cost = 2, outside_cost = 0 .
test.cpp:14: note: vect_model_simple_cost: inside_cost = 2, outside_cost = 1 .
test.cpp:14: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
test.cpp:14: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
test.cpp:14: note: cost model: Adding cost of checks for loop versioning to treat misalignment.

test.cpp:14: note: cost model: Adding cost of checks for loop versioning aliasing.

test.cpp:14: note: Cost model analysis:
  Vector inside of loop cost: 8
  Vector outside of loop cost: 6
  Scalar iteration cost: 5
  Scalar outside cost: 1
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 2

test.cpp:14: note:   Profitability threshold = 3

test.cpp:14: note: Vectorization may not be profitable.
test.cpp:14: note: create runtime check for data references *D.4096_24 and *D.4095_21
test.cpp:14: note: created 1 versioning for alias checks.

test.cpp:14: note: LOOP VECTORIZED.
test.cpp:4: note: vectorized 1 loops in function.

Compilation finished at Wed Feb 16 19:17:59

try:

const float * __restrict__ input = ...;
float * __restrict__ output = ...;

experiment a bit by changing things around:

#include <stdlib.h>
#include <math.h>

int main(int argc, char ** argv) {

    const float * __restrict__ input = new float[20000];
    float * __restrict__  output = new float[20000];

    unsigned int pos=0;
    while(1) {
        unsigned int rest=100;
        output += pos;
        input += pos;
        for(unsigned int i=0;i<rest; ++i) {
            output[i] = input[i] * 0.1;
        }

        pos+=rest;
        if(pos>10000) {
            break;
        }
    }
}

g++ -O3 -g -Wall -ftree-vectorizer-verbose=7 -msse -msse2 -msse3 -c test.cpp

test.cpp:14: note: versioning for alias required: can't determine dependence between *D.4096_24 and *D.4095_21
test.cpp:14: note: mark for run-time aliasing test between *D.4096_24 and *D.4095_21
test.cpp:14: note: Alignment of access forced using versioning.
test.cpp:14: note: Vectorizing an unaligned access.
test.cpp:14: note: vect_model_load_cost: unaligned supported by hardware.
test.cpp:14: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
test.cpp:14: note: vect_model_simple_cost: inside_cost = 2, outside_cost = 0 .
test.cpp:14: note: vect_model_simple_cost: inside_cost = 2, outside_cost = 1 .
test.cpp:14: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
test.cpp:14: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
test.cpp:14: note: cost model: Adding cost of checks for loop versioning to treat misalignment.

test.cpp:14: note: cost model: Adding cost of checks for loop versioning aliasing.

test.cpp:14: note: Cost model analysis:
  Vector inside of loop cost: 8
  Vector outside of loop cost: 6
  Scalar iteration cost: 5
  Scalar outside cost: 1
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 2

test.cpp:14: note:   Profitability threshold = 3

test.cpp:14: note: Vectorization may not be profitable.
test.cpp:14: note: create runtime check for data references *D.4096_24 and *D.4095_21
test.cpp:14: note: created 1 versioning for alias checks.

test.cpp:14: note: LOOP VECTORIZED.
test.cpp:4: note: vectorized 1 loops in function.

Compilation finished at Wed Feb 16 19:17:59

回复收藏 0 原文

~没有更多了~