我正在尝试在 OpenCL 中编写 MergeSort(我知道,BitonicSort 更快,但我想比较它们),目前我遇到了一个奇怪的问题:
如果我将全局大小设置为 1 << 24
和本地大小为 512
,内核无法执行,下一个排队的内核也无法执行。但是,在将内核排队或等待队列完成时,我都没有收到任何类型的错误。没什么,只是内核没有执行。 ComputeProfiler 也显示了这一点 - 没有内核。然而,对于全局大小 1 << 23
该算法运行良好。对于本地大小 256
,最小失败全局大小为 1 << 23
。
为什么会发生这种情况?我认为至少可能有 65535
个工作组(根据 NVidia 编程指南),四舍五入到最接近的 2 的幂,即 32768 == 1 << 15
,本地大小512 == 1 << 9
这意味着具有全局大小 1 << 24
应该还可以。此外,我可以执行具有此全局和本地大小的另一个内核。
最重要的是,没有错误,我无法检测到这已经发生。也许我必须采取一些解决方法(在大集合上手动循环工作组),但我想了解这个问题。
感谢您的任何建议
PS:我在驱动程序为 260.19.26 的 Linux 计算机上使用 NVidia GTX 580。
I am trying to write MergeSort in OpenCL (I know, BitonicSort is faster, but I want to compare them) and currently I have came accross a strange problem:
If I set global size to 1 << 24
and local size to 512
, the kernel just fails to being executed and the next enqueued kernels as well. However, I don't get any kind of error neither when enqueuing the kernel or waiting until the queue gets finished. Nothing, just the kernel is not executed. ComputeProfiler shows it as well - no kernel. However, with global size 1 << 23
the algorithm works well. With local size 256
the minimum failing global size is 1 << 23
.
Why does that happen? I thought there could be at least 65535
workgroups (according to NVidia Programming Guide), rounded down to nearest power of two it is 32768 == 1 << 15
, with local size 512 == 1 << 9
this means that having global size 1 << 24
should be still OK. Moreover, I can execute another kernel with this global and local size.
And most of all, there's no error, I cannot detect that this has happened. Probably I'll have to make some workaround (looping in the workgroups manually over the large set) but I want to understand the problem.
Thanks for any suggestions
PS: I use NVidia GTX 580 on a Linux machine with drivers 260.19.26.
发布评论