当前位置：文江博客话题详情

OpenCL 中全局工作大小是否需要是工作组大小的倍数？

发布于 2024-09-07 06:07:48 字数 328 浏览 12 评论 0原文

您好：OpenCL 中全局工作大小（维度）是否需要是工作组大小（维度）的倍数？

如果是这样，是否有处理矩阵而不是工作组维度的倍数的标准方法？我可以想到两种可能性：

动态地将工作组维度的大小设置为全局工作维度的一个因子。（这会产生寻找因子的开销，并可能将工作组设置为非最佳大小。）

将全局工作的维度增加到工作组维度的最接近倍数，保持所有输入和输出缓冲区相同但检查内核中的边界以避免段错误，即对超出所需输出范围的工作项不执行任何操作。（这似乎是更好的方法。）

第二种方法可行吗？有更好的办法吗？（或者是没有必要，因为工作组维度不需要划分全局工作维度？）

谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦亿 2024-09-14 06:07:48

感谢乍得的链接。但实际上，如果你继续读下去：

如果指定了 local_work_size，则
global_work_size[0], … global_work_size[work_dim - 1] 中指定的值必须均匀
可被 local_work_size[0] 中指定的相应值整除，...
local_work_size[work_dim – 1]。

所以是的，本地工作规模必须是全局工作规模的倍数。

我还认为将全局工作大小分配给最接近的倍数并注意边界应该可行，当我开始尝试时我会发表评论。

回复收藏 0 原文

盗心人 2024-09-14 06:07:48

这似乎是一篇旧文章，但让我用一些新信息更新这篇文章。希望它可以帮助别人。

全局工作大小（维度）是否需要是工作组的倍数
OpenCL 中的大小（尺寸）？

答案：OpenCL 2.0 之前都是如此。在 CL2.0 之前，您的全局工作大小必须是本地工作大小的倍数，否则在执行 clEnqueueNDRangeKernel 时将收到错误消息。

但从 CL2.0 开始，不再需要这个了。您可以使用适合您的应用程序尺寸的任何全局工作尺寸。但是，请记住，硬件实现可能仍然使用“旧”方式，这意味着填充全局工作组大小。因此，使得性能高度依赖于硬件架构。您可能会在不同的硬件/平台上看到截然不同的性能。另外，您希望使您的应用程序重新兼容以支持旧平台，该平台仅支持 CL 最高版本 1.2。所以，我认为CL2.0中添加的这个新功能只是为了方便编程，为了获得更好的可控性能和向后兼容性，我建议您仍然使用您提到的以下方法：

将全局工作的维度增加到最接近的倍数
工作组维度，保持所有输入和输出缓冲区
相同，但检查内核中的边界以避免段错误，即
工作项上没有任何内容超出所需输出的范围。（这
似乎是更好的方法。）

答案：你完全正确。这才是处理此类案件的正确方法。仔细设计本地工作组大小（考虑寄存器使用情况、缓存命中/未命中、内存访问模式等因素）。然后将全局工作大小填充为本地工作大小的倍数。然后，你就可以走了。

另一件需要考虑的事情是，如果内核中有大量边界检查工作，您可以使用图像对象来存储数据而不是缓冲区。对于图像，边界检查由硬件自动完成，在大多数实现中几乎没有开销。因此，填充你的全局工作大小，将你的数据存储在图像对象中，然后，你只需要正常编写你的代码，而不用担心边界检查。

This seems to be an old post, but let me update this post with some new information. Hopefully, it could help someone else.

Does Global Work Size (Dimensions) Need to be Multiple of Work Group
Size (Dimensions) in OpenCL?

Answer: True till OpenCL 2.0. Before CL2.0, your global work size must be a multiple of local work size, otherwise you will get an error message when you execute clEnqueueNDRangeKernel.

But from CL2.0, this is not required anymore. You can use whatever global work size which fits your application dimensions. However, please remember that the hardware implementation might still use the "old" way, which means padding the global work group size. Therefore, it makes the performance highly dependent on the hardware architecture. You may see quite different performance on different hardware/platforms. Plus, you want to make your application back compatible to support older platform which only supports CL up to version 1.2. So, I think this new feature added in CL2.0 is just for easy programming, to get better controllable performance and backward compatibility, I suggest you still use the following method mentioned by you:

Increase the dimensions of the global work to be the nearest multiple
of the work group dimensions, keeping all input and output buffers the
same but checking bounds in the kernel to avoid segfaulting, i.e. do
nothing on the work items out of bound of the desired output. (This
seems like the better way.)

Answer: you are absolutely right. This is the right way to handle such case. Carefully design the local work group size (considering factors such as register usage, cache hit/miss, memory access pattern and so on). And then pad your global work size to a multiple of local work size. Then, you are good to go.

Another thing to consider is that you can utilize the image object to store the data instead of buffer, if there are quite a lot of boundary checking work in your kernel. For image, the boundary check is automatically done by hardware, almost no overhead in most of the implementations. Therefore, padding your global work size, store your data in image object, then, you just need to write your code normally without worrying about the boundary checking.

回复收藏 0 原文

我的鱼塘能养鲲 2024-09-14 06:07:48

根据标准，它不一定是我所看到的。我想我会用一个分支来处理它，但我不知道你到底在做什么类型的矩阵运算。

http://www.khronos.org/registry/ cl/specs/opencl-1.1.pdf#page=131

global_work_size 指向一个数组
work_dim 无符号值
描述全局的数量
work_dim 维度中的工作项
将执行内核函数。这
全局工作项总数是
计算方式为 global_work_size[0] *
... * global_work_size[work_dim – 1].
指定的值
global_work_size + 对应
global_work_offset 中指定的值
不能超出给定的范围
sizeof(size_t) 设备
内核执行将是
排队。 sizeof(size_t) 为
设备可以使用确定
表 4.3 中的CL_DEVICE_ADDRESS_BITS。
例如，如果
CL_DEVICE_ADDRESS_BITS = 32，即
该设备使用 32 位地址
空格，size_t 是 32 位无符号
整数和 global_work_size 值
必须在 1 .. 2^32 - 1 范围内。
超出此范围的值返回
CL_OUT_OF_RESOURCES 错误。