数学“pow”的 SSE 向量化函数海湾合作委员会
我试图对包含数学库中“pow”函数的使用的循环进行矢量化。我知道英特尔编译器支持对 sse 指令使用“pow” - 但我似乎无法让它与 gcc 一起运行(我认为)。这就是我正在处理的情况:
int main(){
int i=0;
float a[256],
b[256];
float x= 2.3;
for (i =0 ; i<256; i++){
a[i]=1.5;
}
for (i=0; i<256; i++){
b[i]=pow(a[i],x);
}
for (i=0; i<256; i++){
b[i]=a[i]*a[i];
}
return 0;
}
我正在使用以下内容进行编译:
gcc -O3 -Wall -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5 code.c -o runthis
这是在 os X 10.5.8 上使用 gcc 版本 4.2(我也使用了 4.5,无法判断它是否对任何内容进行了矢量化 - 因为它没有)根本不输出任何内容)。似乎没有一个循环矢量化 - 是否存在对齐问题或其他一些我不需要使用限制的问题?如果我将其中一个循环编写为函数,我会得到稍微更详细的输出(代码):
void pow2(float *a, float * b, int n) {
int i;
for (i=0; i<n; i++){
b[i]=a[i]*a[i];
}
}
输出(使用级别 7 详细输出):
note: not vectorized: can't determine dependence between *D.2878_13 and *D.2877_8
bad data dependence.
我查看了 gcc 自动矢量化 页面但这并没有多大帮助。如果无法在 gcc 版本中使用 pow,我在哪里可以找到执行 pow 等效函数的资源(我主要处理整数幂)。
编辑所以我只是深入研究其他来源 - 它是如何矢量化的?!:
void array_op(double * d,int len,double value,void (*f)(double*,double*) ) {
for ( int i = 0; i < len; i++ ){
f(&d[i],&value);
}
};
相关的 gcc 输出:
note: Profitability threshold is 3 loop iterations.
note: LOOP VECTORIZED.
现在我不知所措 - 'd'和'value'被修改通过 gcc 不知道的函数 - 奇怪吗?也许我需要更彻底地测试这部分,以确保矢量化部分的结果是正确的。仍在寻找矢量化数学库 - 为什么没有开源库?
I was trying to vectorize a loop that contains the use of the 'pow' function in the math library. I am aware intel compiler supports use of 'pow' for sse instructions - but I can't seem to get it to run with gcc ( I think ). This is the case I am working with:
int main(){
int i=0;
float a[256],
b[256];
float x= 2.3;
for (i =0 ; i<256; i++){
a[i]=1.5;
}
for (i=0; i<256; i++){
b[i]=pow(a[i],x);
}
for (i=0; i<256; i++){
b[i]=a[i]*a[i];
}
return 0;
}
I'm compiling with the following:
gcc -O3 -Wall -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5 code.c -o runthis
This is on os X 10.5.8 using gcc version 4.2 (I used 4.5 as well and couldn't tell if it had vectorized anything - as it didn't output anything at all). It appears that none of the loops vectorize - is there an allignment issue or some other issue that I need t use restrict? If I write one of the loops as a function I get slightly more verbose output(code):
void pow2(float *a, float * b, int n) {
int i;
for (i=0; i<n; i++){
b[i]=a[i]*a[i];
}
}
output (using level 7 verbose output):
note: not vectorized: can't determine dependence between *D.2878_13 and *D.2877_8
bad data dependence.
I looked at the gcc auto-vectorization page but that didnt' help to much. If it is not possible to use pow in the gcc version what where could I find the resource to do a pow - equivalent function (I'm mostly dealing with integer powers).
Edit so I was just digging into so other source- how did it vectorize this?!:
void array_op(double * d,int len,double value,void (*f)(double*,double*) ) {
for ( int i = 0; i < len; i++ ){
f(&d[i],&value);
}
};
The relevant gcc output:
note: Profitability threshold is 3 loop iterations.
note: LOOP VECTORIZED.
Well now I'm at a loss -- 'd' and 'value' are modified by a function that gcc doesn't know about - strange? Maybe I need to test this portion a little more thoroughly to make sure the results are correct for the vectorized portion. Still looking for a vectorized math library - why aren't there any open source ones?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在写入输出之前使用 __restrict 或使用输入(分配给本地变量)应该会有所帮助。
现在,编译器无法矢量化,因为
a
可能会别名b
,因此并行执行 4 次乘法并写回 4 个值可能不正确。(请注意,
__restrict
不能保证编译器矢量化,但可以说的是,现在它肯定不能)。Using
__restrict
or consuming inputs (assigning to local vars) before writing outputs should help.As it is now, the compiler cannot vectorize because
a
might aliasb
, so doing 4 multiplies in parallel and writing back 4 values might not be correct.(Note that
__restrict
won't guarantee that the compiler vectorizes, but so much can be said that right now, it sure cannot).这并不是你问题的真正答案;而是关于如何完全避免这个问题的建议。
你提到你使用的是 OS X;该平台上已经有 API 可以提供您正在查看的操作,而无需自动矢量化。您不使用它们有什么原因吗?自动矢量化确实很酷,但它需要一些工作,而且一般来说,它产生的结果不如使用已经为您矢量化的 API 更好。
更一般地说,除非您的算法非常简单,否则自动矢量化不太可能提供出色的结果。根据我的经验,矢量化技术的范围通常如下所示:
This is not really an answer to your question; but rather a suggestion for how might be able to avoid this issue entirely.
You mention that you're on OS X; there are already APIs on that platform that provide the operations you're looking at, without any need for auto-vectorization. Is there some reason that you aren't using them instead? Auto-vectorization is really cool, but it requires some work, and in general it doesn't produce results that are as good as using APIs that are already vectorized for you.
More generally, unless your algorithm is quite simple, auto-vectorization is unlikely to deliver great results. In my experience, the spectrum of vectorization techniques usually looks about like this: