为什么 GCC 不自动矢量化这个循环?
我有以下 C 程序(我的实际用例的简化,表现出相同的行为)
#include <stdlib.h>
#include <math.h>
int main(int argc, char ** argv) {
const float * __restrict__ const input = malloc(20000*sizeof(float));
float * __restrict__ const output = malloc(20000*sizeof(float));
unsigned int pos=0;
while(1) {
unsigned int rest=100;
for(unsigned int i=pos;i<pos+rest; i++) {
output[i] = input[i] * 0.1;
}
pos+=rest;
if(pos>10000) {
break;
}
}
}
当我编译时,
-O3 -g -Wall -ftree-vectorizer-verbose=5 -msse -msse2 -msse3 -march=native -mtune=native --std=c99 -fPIC -ffast-math
我得到输出,
main.c:10: note: not vectorized: unhandled data-ref
其中 10 是内部 for 循环的行。当我查找为什么它可能会这样说时,它似乎在说指针可以使用别名,但它们不能出现在我的代码中,因为我有 __restrict 关键字。他们还建议包括 -msse 标志,但他们似乎也没有做任何事情。有什么帮助吗?
I have the following C program (a simplification of my actual use case which exhibits the same behavior)
#include <stdlib.h>
#include <math.h>
int main(int argc, char ** argv) {
const float * __restrict__ const input = malloc(20000*sizeof(float));
float * __restrict__ const output = malloc(20000*sizeof(float));
unsigned int pos=0;
while(1) {
unsigned int rest=100;
for(unsigned int i=pos;i<pos+rest; i++) {
output[i] = input[i] * 0.1;
}
pos+=rest;
if(pos>10000) {
break;
}
}
}
When I compile with
-O3 -g -Wall -ftree-vectorizer-verbose=5 -msse -msse2 -msse3 -march=native -mtune=native --std=c99 -fPIC -ffast-math
I get the output
main.c:10: note: not vectorized: unhandled data-ref
where 10 is the line of the inner for loop. When I looked up why it might say this, it seemed to be saying that the pointers could be aliased, but they can't be in my code, as I have the __restrict keyword. They also suggested including the -msse flags, but they don't seem to do anything either. Any help?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这确实看起来像一个错误。在下面的等效函数中,当针对 x86-64 目标进行编译时,
foo()
已矢量化,但bar()
则不是:添加
-m32< /code> 标志,改为编译 x86 目标,导致两个函数都被矢量化。
It certainly seems like a bug. In the following, equivalent functions,
foo()
is vectorised butbar()
is not, when compiling for an x86-64 target:Adding the
-m32
flag, to compile for an x86 target instead, causes both functions to be vectorised.它不喜欢外循环格式,这会阻止它理解内循环。如果我只是将其折叠成一个循环,我就可以对其进行矢量化:(
请注意,我并没有认真考虑如何将 pos+rest 限制正确地转换为单个 for 循环条件,这可能是错误的)
您可能可以通过将简化的内部循环放入使用指针和计数调用的函数中来利用这一点。即使再次内联它也可以正常工作。这是假设您删除了我刚刚简化的
while()
循环的部分内容,但您需要保留它们。It doesn't like the outer loop format which is preventing it from understanding the inner loop. I can get it to vectorize if I just fold it into a single loop:
(note that I didn't think too hard about how to properly translate the pos+rest limit into a single for loop condition, it may be wrong)
You might be able to take advantage of this by putting a simplified inner loop into a function which you call with pointers and a count. Even when it is inlined again it may work fine. This is assuming you deleted parts of your
while()
loop that I have just simplified away but you need to retain.尝试:
通过改变周围的事物来进行一些实验:
try:
experiment a bit by changing things around: