OpenMP设备卸载减少到现有设备内存位置
如何告诉OpenMP设备卸载以使用设备内存中的现有位置进行减少?我想避免使用到设备的数据移动。结果只能在设备上访问。
这是我的代码
void reduce(const double *mi, const double *xi, const double *yi,
double *mo, double *xo, double *yo, long n)
{
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
for (long i = 0; i < n; ++i)
{
mo[0] += mi[i];
xo[0] += mi[i]*xi[i];
yo[0] += mi[i]*yi[i];
}
#pragma omp target is_device_ptr(mo,xo,yo)
{
xo[0] /= mo[0];
yo[0] /= mo[0];
}
}
和clang ++ 15定位NVIDIA PTX的代码,我会遇到错误:
test.cpp:6:109: error: reduction variable cannot be in a is_device_ptr clause in '#pragma omp target teams distribute parallel for' directive
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
^
test.cpp:6:67: note: defined as reduction
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
^
How do I tell OpenMP device offload to use an existing location in device memory for a reduction? I want to avoid data movement to/from device. Results will only be accessed on the device.
Here's my code
void reduce(const double *mi, const double *xi, const double *yi,
double *mo, double *xo, double *yo, long n)
{
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
for (long i = 0; i < n; ++i)
{
mo[0] += mi[i];
xo[0] += mi[i]*xi[i];
yo[0] += mi[i]*yi[i];
}
#pragma omp target is_device_ptr(mo,xo,yo)
{
xo[0] /= mo[0];
yo[0] /= mo[0];
}
}
with this code and clang++ 15 targeting nvidia ptx, I'm getting the error:
test.cpp:6:109: error: reduction variable cannot be in a is_device_ptr clause in '#pragma omp target teams distribute parallel for' directive
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
^
test.cpp:6:67: note: defined as reduction
#pragma omp target teams distribute parallel for reduction(+: mo[0],xo[0],yo[0]) is_device_ptr(mi,xi,yi,mo,xo,yo)
^
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您不能在
yduction
子句中使用数组下标。那是不合格的代码。请尝试使用这些行:使用
目标数据
构造,您可以为GPU上的减少变量分配缓冲区。target
构造的还原
子句将将值降低到该缓冲变量中,并且只会在目标数据的闭合卷曲括号处从GPU转移变量
构造。You cannot use array subscripts in
reduction
clause. That's non-conforming code. Please try something along these lines:With the the
target data
construct you can allocate a buffer for the reduction variable on the GPU. Thetarget
construct'sreduction
clause will then reduce the value into that buffered variable and will only transfer the variable back from the GPU at the closing curly brace of thetarget data
construct.可以在设备上完成所有操作。需要几个关键信息:
MAP
子句的MAP类型用于优化到设备/从设备上的副本,并禁用不必要的副本。alloc
map-type禁用两个副本与设备。MAP
已通过包装目标数据
或目标输入数据
默认情况下的变量的子句默认情况下不会导致副本访问/来自设备。因此,解决方案如下:
It is possible to do this all on the device. A couple of key pieces of info were required:
map
clause's map-type is used to optimize the copies to/from the device and disable unnecessary copies. Thealloc
map-type disables both copies to and from the device.map
clauses for variables already mapped by an enclosingtarget data
ortarget enter data
by default do not result in copies to/from the device.With that the solution is as follows:
当心这是一个暂定的,未验证的答案,因为我没有目标, Compiler Explorer non.t似乎具有启用卸载的GCC。因此,这是未经测试的。
但是,您可以清楚地自己尝试!
我建议将指令分开,并添加明确的标量当地人以减少。
因此,您的代码看起来像这样
编译的主机可以,但是,如上所述,您的里程可能会有所不同
BEWARE this is a tentative, unverified answer, since I don't have the target, and Compiler Explorer doesn.t seem to have a gcc which has offload enabled. Hence this is untested.
However, you can clearly try this for yourself!
I suggest splitting the directives, and adding explicit scalar locals for the reduction.
So your code would look something like this
That compiles OK for the host, but, as above, YOUR MILEAGE MAY VARY