OpenCL 中全局工作大小是否需要是工作组大小的倍数?
您好:OpenCL 中全局工作大小(维度)是否需要是工作组大小(维度)的倍数?
如果是这样,是否有处理矩阵而不是工作组维度的倍数的标准方法?我可以想到两种可能性:
动态地将工作组维度的大小设置为全局工作维度的一个因子。 (这会产生寻找因子的开销,并可能将工作组设置为非最佳大小。)
将全局工作的维度增加到工作组维度的最接近倍数,保持所有输入和输出缓冲区相同但检查内核中的边界以避免段错误,即对超出所需输出范围的工作项不执行任何操作。 (这似乎是更好的方法。)
第二种方法可行吗?有更好的办法吗? (或者是没有必要,因为工作组维度不需要划分全局工作维度?)
谢谢!
Hello: Does Global Work Size (Dimensions) Need to be Multiple of Work Group Size (Dimensions) in OpenCL?
If so, is there a standard way of handling matrices not a multiple of the work group dimensions? I can think of two possibilities:
Dynamically set the size of the work group dimensions to a factor of the global work dimensions. (this would incur the overhead of finding a factor and possibly set the work group to a non-optimal size.)
Increase the dimensions of the global work to be the nearest multiple of the work group dimensions, keeping all input and output buffers the same but checking bounds in the kernel to avoid segfaulting, i.e. do nothing on the work items out of bound of the desired output. (This seems like the better way.)
Would the second way work? Is there a better way? (Or is it not necessary because work group dimensions need not divide global work dimensions?)
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
感谢乍得的链接。但实际上,如果你继续读下去:
所以是的,本地工作规模必须是全局工作规模的倍数。
我还认为将全局工作大小分配给最接近的倍数并注意边界应该可行,当我开始尝试时我会发表评论。
Thx for the link Chad. But actually, if you read on:
So YES, the local work size must be a multiple of the global work size.
I also think the assigning the global work size to the nearest multiple and being careful about bounds should work, I'll post a comment when I get around to trying it.
这似乎是一篇旧文章,但让我用一些新信息更新这篇文章。希望它可以帮助别人。
答案:OpenCL 2.0 之前都是如此。在 CL2.0 之前,您的全局工作大小必须是本地工作大小的倍数,否则在执行 clEnqueueNDRangeKernel 时将收到错误消息。
但从 CL2.0 开始,不再需要这个了。您可以使用适合您的应用程序尺寸的任何全局工作尺寸。但是,请记住,硬件实现可能仍然使用“旧”方式,这意味着填充全局工作组大小。因此,使得性能高度依赖于硬件架构。您可能会在不同的硬件/平台上看到截然不同的性能。另外,您希望使您的应用程序重新兼容以支持旧平台,该平台仅支持 CL 最高版本 1.2。所以,我认为CL2.0中添加的这个新功能只是为了方便编程,为了获得更好的可控性能和向后兼容性,我建议您仍然使用您提到的以下方法:
答案:你完全正确。这才是处理此类案件的正确方法。仔细设计本地工作组大小(考虑寄存器使用情况、缓存命中/未命中、内存访问模式等因素)。然后将全局工作大小填充为本地工作大小的倍数。然后,你就可以走了。
另一件需要考虑的事情是,如果内核中有大量边界检查工作,您可以使用图像对象来存储数据而不是缓冲区。对于图像,边界检查由硬件自动完成,在大多数实现中几乎没有开销。因此,填充你的全局工作大小,将你的数据存储在图像对象中,然后,你只需要正常编写你的代码,而不用担心边界检查。
This seems to be an old post, but let me update this post with some new information. Hopefully, it could help someone else.
Answer: True till OpenCL 2.0. Before CL2.0, your global work size must be a multiple of local work size, otherwise you will get an error message when you execute clEnqueueNDRangeKernel.
But from CL2.0, this is not required anymore. You can use whatever global work size which fits your application dimensions. However, please remember that the hardware implementation might still use the "old" way, which means padding the global work group size. Therefore, it makes the performance highly dependent on the hardware architecture. You may see quite different performance on different hardware/platforms. Plus, you want to make your application back compatible to support older platform which only supports CL up to version 1.2. So, I think this new feature added in CL2.0 is just for easy programming, to get better controllable performance and backward compatibility, I suggest you still use the following method mentioned by you:
Answer: you are absolutely right. This is the right way to handle such case. Carefully design the local work group size (considering factors such as register usage, cache hit/miss, memory access pattern and so on). And then pad your global work size to a multiple of local work size. Then, you are good to go.
Another thing to consider is that you can utilize the image object to store the data instead of buffer, if there are quite a lot of boundary checking work in your kernel. For image, the boundary check is automatically done by hardware, almost no overhead in most of the implementations. Therefore, padding your global work size, store your data in image object, then, you just need to write your code normally without worrying about the boundary checking.
根据标准,它不一定是我所看到的。我想我会用一个分支来处理它,但我不知道你到底在做什么类型的矩阵运算。
http://www.khronos.org/registry/ cl/specs/opencl-1.1.pdf#page=131
According to the standard it doesn't have to be from what I saw. I think I would handle it with a branch, but I don't know exactly what kind of matrix operation you are doing.
http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf#page=131