OpenCL 原子添加操作的奇怪行为
对于一个项目,我必须深入研究 OpenCL:一切进展顺利,只是现在我需要原子操作。 我正在 Nvidia GPU 上使用最新的驱动程序执行 OpenCL 代码。 clGetDeviceInfo()
查询 CL_DEVICE_VERSION
返回我: OpenCL 1.0 CUDA
,因此我想我必须参考 OpenCL 1.0 规范。
我开始在内核中的 __global int* vnumber
缓冲区上使用 atom_add
操作: atom_add(&vnumber[0], 1);
。这给了我明显错误的结果。因此,作为额外的检查,我将 add 指令移到了内核的开头,以便为每个线程执行它。当内核以 512 x 512 线程启动时,vnumber[0]
的内容为:524288
,正好是 2 x 512 x 512,是该值的两倍我应该得到。有趣的是,通过将添加操作更改为 atom_add(&vnumber[0], 2);
,返回值是 65536
,又是我应该的两倍得到。
有人已经经历过类似的事情吗?我错过了一些非常基本的东西吗?我已经检查了数据类型的正确性,但看起来没问题(我正在使用 *int
缓冲区,并使用 sizeof(cl_int)
分配它)。
For a project, I had to dive into OpenCL: things are going fairly well except now that I need atomic operations.
I'm executing the OpenCL code on top of an Nvidia GPU, with the last drivers. clGetDeviceInfo()
querying CL_DEVICE_VERSION
returns me:OpenCL 1.0 CUDA
, hence I guess I have to refer to the OpenCL 1.0 specs.
I started using an atom_add
operation in my kernel on a __global int* vnumber
buffer:atom_add(&vnumber[0], 1);
. This gave me clearly wrong results. Thus, as an additional check, I have moved the add instruction at the beginning of the kernel, so that it is executed for each thread. When the kernel is launched with 512 x 512 threads, the content of vnumber[0]
is: 524288
, which is exactly 2 x 512 x 512, two times the value that I should get. The funny thing is that by changing the add operation to atom_add(&vnumber[0], 2);
, the returned value is 65536
, again two times what I should get.
Did someone already experienced something similar? Am I missing something very basic? I have checked the correctness of data types but it seems ok (I'm using *int
buffer, and allocating it with sizeof(cl_int)
).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您正在使用atom_add,它是本地内存的OpenCL 1.0 扩展。然而你正在传递它的全球记忆。相反,请尝试 OpenCL 1.1 的atomic_add,它适用于全局内存。
You are using atom_add, which is an OpenCL 1.0 extension for local memory. Yet you are passing it global memory. Instead, try OpenCL 1.1's atomic_add, which works with global memory.