OpenMP设备卸载减少到现有设备内存位置

发布于 2025-02-11 08:35:25 字数 1368 浏览 1 评论 0原文

如何告诉OpenMP设备卸载以使用设备内存中的现有位置进行减少？我想避免使用到设备的数据移动。结果只能在设备上访问。

这是我的代码

void reduce(const double *mi, const double *xi, const double *yi,                                                                                
    double *mo, double *xo, double *yo, long n)
{
    #pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
    for (long i = 0; i < n; ++i)
    {   
        mo[0] += mi[i];
        xo[0] += mi[i]*xi[i];
        yo[0] += mi[i]*yi[i];
    }   

    #pragma omp target is_device_ptr(mo,xo,yo)
    {   
    xo[0] /= mo[0];
    yo[0] /= mo[0];
    }   
}

和clang ++ 15定位NVIDIA PTX的代码，我会遇到错误：

test.cpp:6:109: error: reduction variable cannot be in a is_device_ptr clause in '#pragma omp target teams distribute parallel for' directive
    #pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
                                                                                                            ^
test.cpp:6:67: note: defined as reduction
    #pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
                                                                  ^

原文

How do I tell OpenMP device offload to use an existing location in device memory for a reduction? I want to avoid data movement to/from device. Results will only be accessed on the device.

Here's my code

void reduce(const double *mi, const double *xi, const double *yi,                                                                                
    double *mo, double *xo, double *yo, long n)
{
    #pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
    for (long i = 0; i < n; ++i)
    {   
        mo[0] += mi[i];
        xo[0] += mi[i]*xi[i];
        yo[0] += mi[i]*yi[i];
    }   

    #pragma omp target is_device_ptr(mo,xo,yo)
    {   
    xo[0] /= mo[0];
    yo[0] /= mo[0];
    }   
}

with this code and clang++ 15 targeting nvidia ptx, I'm getting the error:

test.cpp:6:109: error: reduction variable cannot be in a is_device_ptr clause in '#pragma omp target teams distribute parallel for' directive
    #pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
                                                                                                            ^
test.cpp:6:67: note: defined as reduction
    #pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
                                                                  ^

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

素罗衫 2025-02-18 08:35:25

您不能在yduction子句中使用数组下标。那是不合格的代码。请尝试使用这些行：

#include <stdio.h>

int main(int argc, char * argv[]) {
    double sum = 0;

    #pragma omp target data map(tofrom:sum)
    {
        for (int t = 0; t < 10; t++) {
            #pragma omp target teams distribute parallel for map(tofrom:sum) reduction(+:sum)
            for (int j = 0; j < 10000; j++) {
                sum += 1;
            }
        }
    }

    printf("sum=%lf\n", sum);
    return 0;
}

使用目标数据构造，您可以为GPU上的减少变量分配缓冲区。 target构造的还原子句将将值降低到该缓冲变量中，并且只会在目标数据的闭合卷曲括号处从GPU转移变量构造。

You cannot use array subscripts in reduction clause. That's non-conforming code. Please try something along these lines:

#include <stdio.h>

int main(int argc, char * argv[]) {
    double sum = 0;

    #pragma omp target data map(tofrom:sum)
    {
        for (int t = 0; t < 10; t++) {
            #pragma omp target teams distribute parallel for map(tofrom:sum) reduction(+:sum)
            for (int j = 0; j < 10000; j++) {
                sum += 1;
            }
        }
    }

    printf("sum=%lf\n", sum);
    return 0;
}

With the the target data construct you can allocate a buffer for the reduction variable on the GPU. The target construct's reduction clause will then reduce the value into that buffered variable and will only transfer the variable back from the GPU at the closing curly brace of the target data construct.

回复收藏 0 原文

诗化ㄋ丶相逢 2025-02-18 08:35:25

可以在设备上完成所有操作。需要几个关键信息：

MAP子句的MAP类型用于优化到设备/从设备上的副本，并禁用不必要的副本。 alloc map-type禁用两个副本与设备。
变量在设备中只有一个实例。 MAP已通过包装目标数据或目标输入数据默认情况下的变量的子句默认情况下不会导致副本访问/来自设备。

因此，解决方案如下：

// --------------------------------------------------------------------------
void reduce(const double *mi, const double *xi, const double *yi,
    double *mo, double *xo, double *yo, long n)
{
    double m, x, y;
    #pragma omp target enter data map(alloc: m,x,y)

    #pragma omp target map(alloc: m,x,y)
    {   
    m = 0.; 
    x = 0.; 
    y = 0.; 
    }   

    #pragma omp target teams distribute parallel for reduction(+: m,x,y), \
        is_device_ptr(mi,xi,yi), map(alloc: m,x,y)
    for (long i = 0; i < n; ++i)
    {   
        m += mi[i];
        x += mi[i]*xi[i];
        y += mi[i]*yi[i];
    }   

    #pragma omp target is_device_ptr(mo,xo,yo), map(alloc: m,x,y)
    {   
    mo[0] = m;
    xo[0] = x/m;
    yo[0] = y/m;
    }   

    #pragma omp target exit data map(release: m,x,y)
}

It is possible to do this all on the device. A couple of key pieces of info were required:

the map clause's map-type is used to optimize the copies to/from the device and disable unnecessary copies. The alloc map-type disables both copies to and from the device.
variables have only one instance in the device. map clauses for variables already mapped by an enclosing target data or target enter data by default do not result in copies to/from the device.

With that the solution is as follows:

// --------------------------------------------------------------------------
void reduce(const double *mi, const double *xi, const double *yi,
    double *mo, double *xo, double *yo, long n)
{
    double m, x, y;
    #pragma omp target enter data map(alloc: m,x,y)

    #pragma omp target map(alloc: m,x,y)
    {   
    m = 0.; 
    x = 0.; 
    y = 0.; 
    }   

    #pragma omp target teams distribute parallel for reduction(+: m,x,y), \
        is_device_ptr(mi,xi,yi), map(alloc: m,x,y)
    for (long i = 0; i < n; ++i)
    {   
        m += mi[i];
        x += mi[i]*xi[i];
        y += mi[i]*yi[i];
    }   

    #pragma omp target is_device_ptr(mo,xo,yo), map(alloc: m,x,y)
    {   
    mo[0] = m;
    xo[0] = x/m;
    yo[0] = y/m;
    }   

    #pragma omp target exit data map(release: m,x,y)
}

回复收藏 0 原文

趁微风不噪 2025-02-18 08:35:25

当心这是一个暂定的，未验证的答案，因为我没有目标， Compiler Explorer non.t似乎具有启用卸载的GCC。因此，这是未经测试的。

但是，您可以清楚地自己尝试！

我建议将指令分开，并添加明确的标量当地人以减少。

因此，您的代码看起来像这样

void reduce(const double *mi, const double *xi, const double *yi,                                                                                
    double *mo, double *xo, double *yo, long n)
{
    #pragma omp target is_device_ptr(mi,xi,yi,mo,xo,yo)
    {
        double mTotal = 0.0;
        double xTotal = 0.0;
        double yTotal = 0.0;

        #pragma omp teams distribute parallel for reduction(+: mTotal, xTotal, yTotal) 
        for (long i = 0; i < n; ++i)
        {   
            mTotal += mi[i];
            xTotal += mi[i]*xi[i];
            yTotal += mi[i]*yi[i];
        }   
   
        mo[0] = mTotal; 
        xo[0] = xTotal/mTotal;
        yo[0] = yTotal/mTotal;
    }   
}

编译的主机可以，但是，如上所述，您的里程可能会有所不同

BEWARE this is a tentative, unverified answer, since I don't have the target, and Compiler Explorer doesn.t seem to have a gcc which has offload enabled. Hence this is untested.

However, you can clearly try this for yourself!

I suggest splitting the directives, and adding explicit scalar locals for the reduction.

So your code would look something like this

void reduce(const double *mi, const double *xi, const double *yi,                                                                                
    double *mo, double *xo, double *yo, long n)
{
    #pragma omp target is_device_ptr(mi,xi,yi,mo,xo,yo)
    {
        double mTotal = 0.0;
        double xTotal = 0.0;
        double yTotal = 0.0;

        #pragma omp teams distribute parallel for reduction(+: mTotal, xTotal, yTotal) 
        for (long i = 0; i < n; ++i)
        {   
            mTotal += mi[i];
            xTotal += mi[i]*xi[i];
            yTotal += mi[i]*yi[i];
        }   
   
        mo[0] = mTotal; 
        xo[0] = xTotal/mTotal;
        yo[0] = yTotal/mTotal;
    }   
}

That compiles OK for the host, but, as above, YOUR MILEAGE MAY VARY

回复收藏 0 原文

~没有更多了~