数学“pow”的 SSE 向量化函数海湾合作委员会

发布于 2024-11-27 20:13:15 字数 1708 浏览 2 评论 0原文

我试图对包含数学库中“pow”函数的使用的循环进行矢量化。我知道英特尔编译器支持对 sse 指令使用“pow” - 但我似乎无法让它与 gcc 一起运行(我认为)。这就是我正在处理的情况:

int main(){
        int i=0;
        float a[256],
        b[256];

        float x= 2.3;


        for  (i =0 ; i<256; i++){
                a[i]=1.5;
        }

        for (i=0; i<256; i++){
                b[i]=pow(a[i],x);
        }

        for (i=0; i<256; i++){
                b[i]=a[i]*a[i];
        }
    return 0;

}

我正在使用以下内容进行编译:

gcc -O3 -Wall -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5 code.c -o runthis

这是在 os X 10.5.8 上使用 gcc 版本 4.2(我也使用了 4.5,无法判断它是否对任何内容进行了矢量化 - 因为它没有)根本不输出任何内容)。似乎没有一个循环矢量化 - 是否存在对齐问题或其他一些我不需要使用限制的问题?如果我将其中一个循环编写为函数,我会得到稍微更详细的输出(代码):

void pow2(float *a, float * b, int n) {
        int i;
        for (i=0; i<n; i++){
                b[i]=a[i]*a[i];
        }
}

输出(使用级别 7 详细输出):

note: not vectorized: can't determine dependence between *D.2878_13 and *D.2877_8
bad data dependence.

我查看了 gcc 自动矢量化 页面但这并没有多大帮助。如果无法在 gcc 版本中使用 pow,我在哪里可以找到执行 pow 等效函数的资源(我主要处理整数幂)。

编辑所以我只是深入研究其他来源 - 它是如何矢量化的?!:

void array_op(double * d,int len,double value,void (*f)(double*,double*) ) { 
    for ( int i = 0; i < len; i++ ){
        f(&d[i],&value);
    }
};

相关的 gcc 输出:

note: Profitability threshold is 3 loop iterations.

note: LOOP VECTORIZED.

现在我不知所措 - 'd'和'value'被修改通过 gcc 不知道的函数 - 奇怪吗?也许我需要更彻底地测试这部分,以确保矢量化部分的结果是正确的。仍在寻找矢量化数学库 - 为什么没有开源库?

I was trying to vectorize a loop that contains the use of the 'pow' function in the math library. I am aware intel compiler supports use of 'pow' for sse instructions - but I can't seem to get it to run with gcc ( I think ). This is the case I am working with:

int main(){
        int i=0;
        float a[256],
        b[256];

        float x= 2.3;


        for  (i =0 ; i<256; i++){
                a[i]=1.5;
        }

        for (i=0; i<256; i++){
                b[i]=pow(a[i],x);
        }

        for (i=0; i<256; i++){
                b[i]=a[i]*a[i];
        }
    return 0;

}

I'm compiling with the following:

gcc -O3 -Wall -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5 code.c -o runthis

This is on os X 10.5.8 using gcc version 4.2 (I used 4.5 as well and couldn't tell if it had vectorized anything - as it didn't output anything at all). It appears that none of the loops vectorize - is there an allignment issue or some other issue that I need t use restrict? If I write one of the loops as a function I get slightly more verbose output(code):

void pow2(float *a, float * b, int n) {
        int i;
        for (i=0; i<n; i++){
                b[i]=a[i]*a[i];
        }
}

output (using level 7 verbose output):

note: not vectorized: can't determine dependence between *D.2878_13 and *D.2877_8
bad data dependence.

I looked at the gcc auto-vectorization page but that didnt' help to much. If it is not possible to use pow in the gcc version what where could I find the resource to do a pow - equivalent function (I'm mostly dealing with integer powers).

Edit so I was just digging into so other source- how did it vectorize this?!:

void array_op(double * d,int len,double value,void (*f)(double*,double*) ) { 
    for ( int i = 0; i < len; i++ ){
        f(&d[i],&value);
    }
};

The relevant gcc output:

note: Profitability threshold is 3 loop iterations.

note: LOOP VECTORIZED.

Well now I'm at a loss -- 'd' and 'value' are modified by a function that gcc doesn't know about - strange? Maybe I need to test this portion a little more thoroughly to make sure the results are correct for the vectorized portion. Still looking for a vectorized math library - why aren't there any open source ones?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

不打扰别人 2024-12-04 20:13:15

在写入输出之前使用 __restrict 或使用输入(分配给本地变量)应该会有所帮助。

现在,编译器无法矢量化,因为 a 可能会别名 b,因此并行执行 4 次乘法并写回 4 个值可能不正确。

(请注意,__restrict 不能保证编译器矢量化,但可以说的是,现在它肯定不能)。

Using __restrict or consuming inputs (assigning to local vars) before writing outputs should help.

As it is now, the compiler cannot vectorize because a might alias b, so doing 4 multiplies in parallel and writing back 4 values might not be correct.

(Note that __restrict won't guarantee that the compiler vectorizes, but so much can be said that right now, it sure cannot).

过潦 2024-12-04 20:13:15

这并不是你问题的真正答案;而是关于如何完全避免这个问题的建议。

你提到你使用的是 OS X;该平台上已经有 API 可以提供您正在查看的操作,而无需自动矢量化。您不使用它们有什么原因吗?自动矢量化确实很酷,但它需要一些工作,而且一般来说,它产生的结果不如使用已经为您矢量化的 API 更好。

#include <string.h>
#include <Accelerate/Accelerate.h>

int main() {

    int n = 256;
    float a[256],
    b[256];

    // You can initialize the elements of a vector to a set value using memset_pattern:
    float threehalves = 1.5f;
    memset_pattern4(a, &threehalves, 4*n);

    // Since you have a fixed exponent for all of the base values, we will use
    // the vImage gamma functions.  If you wanted to have different exponents
    // for each input (i.e. from an array of exponents), you would use the vForce
    // vvpowf( ) function instead (also part of Accelerate).
    //
    // If you don't need full accuracy, replace kvImageGamma_UseGammaValue with
    // kvImageGamma_UseGammaValue_half_precision to get better performance.
    GammaFunction func = vImageCreateGammaFunction(2.3f, kvImageGamma_UseGammaValue, 0);
    vImage_Buffer src = { .data = a, .height = 1, .width = n, .rowBytes = 4*n };
    vImage_Buffer dst = { .data = b, .height = 1, .width = n, .rowBytes = 4*n };
    vImageGamma_PlanarF(&src, &dst, func, 0);
    vImageDestroyGammaFunction(func);

    // To simply square a instead, use the vDSP_vsq function.
    vDSP_vsq(a, 1, b, 1, n);

    return 0;
}

更一般地说,除非您的算法非常简单,否则自动矢量化不太可能提供出色的结果。根据我的经验,矢量化技术的范围通常如下所示:

better performance                                            worse performance
more effort                                                         less effort
+------+------+----------------------+----------------------------+-----------+
|      |      |                      |                            |           |
|      |  use vectorized APIs        |                   auto vectorization   |
|  skilled vector C                  |                              scalar code
hand written assembly       unskilled vector C

This is not really an answer to your question; but rather a suggestion for how might be able to avoid this issue entirely.

You mention that you're on OS X; there are already APIs on that platform that provide the operations you're looking at, without any need for auto-vectorization. Is there some reason that you aren't using them instead? Auto-vectorization is really cool, but it requires some work, and in general it doesn't produce results that are as good as using APIs that are already vectorized for you.

#include <string.h>
#include <Accelerate/Accelerate.h>

int main() {

    int n = 256;
    float a[256],
    b[256];

    // You can initialize the elements of a vector to a set value using memset_pattern:
    float threehalves = 1.5f;
    memset_pattern4(a, &threehalves, 4*n);

    // Since you have a fixed exponent for all of the base values, we will use
    // the vImage gamma functions.  If you wanted to have different exponents
    // for each input (i.e. from an array of exponents), you would use the vForce
    // vvpowf( ) function instead (also part of Accelerate).
    //
    // If you don't need full accuracy, replace kvImageGamma_UseGammaValue with
    // kvImageGamma_UseGammaValue_half_precision to get better performance.
    GammaFunction func = vImageCreateGammaFunction(2.3f, kvImageGamma_UseGammaValue, 0);
    vImage_Buffer src = { .data = a, .height = 1, .width = n, .rowBytes = 4*n };
    vImage_Buffer dst = { .data = b, .height = 1, .width = n, .rowBytes = 4*n };
    vImageGamma_PlanarF(&src, &dst, func, 0);
    vImageDestroyGammaFunction(func);

    // To simply square a instead, use the vDSP_vsq function.
    vDSP_vsq(a, 1, b, 1, n);

    return 0;
}

More generally, unless your algorithm is quite simple, auto-vectorization is unlikely to deliver great results. In my experience, the spectrum of vectorization techniques usually looks about like this:

better performance                                            worse performance
more effort                                                         less effort
+------+------+----------------------+----------------------------+-----------+
|      |      |                      |                            |           |
|      |  use vectorized APIs        |                   auto vectorization   |
|  skilled vector C                  |                              scalar code
hand written assembly       unskilled vector C
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文