OpenCL 中全局工作大小是否需要是工作组大小的倍数?

发布于 2024-09-07 06:07:48 字数 328 浏览 5 评论 0原文

您好:OpenCL 中全局工作大小(维度)是否需要是工作组大小(维度)的倍数?

如果是这样,是否有处理矩阵而不是工作组维度的倍数的标准方法?我可以想到两种可能性:

动态地将工作组维度的大小设置为全局工作维度的一个因子。 (这会产生寻找因子的开销,并可能将工作组设置为非最佳大小。)

将全局工作的维度增加到工作组维度的最接近倍数,保持所有输入和输出缓冲区相同但检查内核中的边界以避免段错误,即对超出所需输出范围的工作项不执行任何操作。 (这似乎是更好的方法。)

第二种方法可行吗?有更好的办法吗? (或者是没有必要,因为工作组维度不需要划分全局工作维度?)

谢谢!

Hello: Does Global Work Size (Dimensions) Need to be Multiple of Work Group Size (Dimensions) in OpenCL?

If so, is there a standard way of handling matrices not a multiple of the work group dimensions? I can think of two possibilities:

Dynamically set the size of the work group dimensions to a factor of the global work dimensions. (this would incur the overhead of finding a factor and possibly set the work group to a non-optimal size.)

Increase the dimensions of the global work to be the nearest multiple of the work group dimensions, keeping all input and output buffers the same but checking bounds in the kernel to avoid segfaulting, i.e. do nothing on the work items out of bound of the desired output. (This seems like the better way.)

Would the second way work? Is there a better way? (Or is it not necessary because work group dimensions need not divide global work dimensions?)

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

梦亿 2024-09-14 06:07:48

感谢乍得的链接。但实际上,如果你继续读下去:

如果指定了 local_work_size,则
global_work_size[0], … global_work_size[work_dim - 1] 中指定的值必须均匀
可被 local_work_size[0] 中指定的相应值整除,...
local_work_size[work_dim – 1]。

所以是的,本地工作规模必须是全局工作规模的倍数。

我还认为将全局工作大小分配给最接近的倍数并注意边界应该可行,当我开始尝试时我会发表评论。

Thx for the link Chad. But actually, if you read on:

If local_work_size is specified, the
values specified in global_work_size[0], … global_work_size[work_dim - 1] must be evenly
divisible by the corresponding values specified in local_work_size[0], …
local_work_size[work_dim – 1].

So YES, the local work size must be a multiple of the global work size.

I also think the assigning the global work size to the nearest multiple and being careful about bounds should work, I'll post a comment when I get around to trying it.

盗心人 2024-09-14 06:07:48

这似乎是一篇旧文章,但让我用一些新信息更新这篇文章。希望它可以帮助别人。

全局工作大小(维度)是否需要是工作组的倍数
OpenCL 中的大小(尺寸)?

答案:OpenCL 2.0 之前都是如此。在 CL2.0 之前,您的全局工作大小必须是本地工作大小的倍数,否则在执行 clEnqueueNDRangeKernel 时将收到错误消息。

但从 CL2.0 开始,不再需要这个了。您可以使用适合您的应用程序尺寸的任何全局工作尺寸。但是,请记住,硬件实现可能仍然使用“旧”方式,这意味着填充全局工作组大小。因此,使得性能高度依赖于硬件架构。您可能会在不同的硬件/平台上看到截然不同的性能。另外,您希望使您的应用程序重新兼容以支持旧平台,该平台仅支持 CL 最高版本 1.2。所以,我认为CL2.0中添加的这个新功能只是为了方便编程,为了获得更好的可控性能和向后兼容性,我建议您仍然使用您提到的以下方法:

将全局工作的维度增加到最接近的倍数
工作组维度,保持所有输入和输出缓冲区
相同,但检查内核中的边界以避免段错误,即
工作项上没有任何内容超出所需输出的范围。 (这
似乎是更好的方法。)

答案:你完全正确。这才是处理此类案件的正确方法。仔细设计本地工作组大小(考虑寄存器使用情况、缓存命中/未命中、内存访问模式等因素)。然后将全局工作大小填充为本地工作大小的倍数。然后,你就可以走了。

另一件需要考虑的事情是,如果内核中有大量边界检查工作,您可以使用图像对象来存储数据而不是缓冲区。对于图像,边界检查由硬件自动完成,在大多数实现中几乎没有开销。因此,填充你的全局工作大小,将你的数据存储在图像对象中,然后,你只需要正常编写你的代码,而不用担心边界检查。

This seems to be an old post, but let me update this post with some new information. Hopefully, it could help someone else.

Does Global Work Size (Dimensions) Need to be Multiple of Work Group
Size (Dimensions) in OpenCL?

Answer: True till OpenCL 2.0. Before CL2.0, your global work size must be a multiple of local work size, otherwise you will get an error message when you execute clEnqueueNDRangeKernel.

But from CL2.0, this is not required anymore. You can use whatever global work size which fits your application dimensions. However, please remember that the hardware implementation might still use the "old" way, which means padding the global work group size. Therefore, it makes the performance highly dependent on the hardware architecture. You may see quite different performance on different hardware/platforms. Plus, you want to make your application back compatible to support older platform which only supports CL up to version 1.2. So, I think this new feature added in CL2.0 is just for easy programming, to get better controllable performance and backward compatibility, I suggest you still use the following method mentioned by you:

Increase the dimensions of the global work to be the nearest multiple
of the work group dimensions, keeping all input and output buffers the
same but checking bounds in the kernel to avoid segfaulting, i.e. do
nothing on the work items out of bound of the desired output. (This
seems like the better way.)

Answer: you are absolutely right. This is the right way to handle such case. Carefully design the local work group size (considering factors such as register usage, cache hit/miss, memory access pattern and so on). And then pad your global work size to a multiple of local work size. Then, you are good to go.

Another thing to consider is that you can utilize the image object to store the data instead of buffer, if there are quite a lot of boundary checking work in your kernel. For image, the boundary check is automatically done by hardware, almost no overhead in most of the implementations. Therefore, padding your global work size, store your data in image object, then, you just need to write your code normally without worrying about the boundary checking.

我的鱼塘能养鲲 2024-09-14 06:07:48

根据标准,它不一定是我所看到的。我想我会用一个分支来处理它,但我不知道你到底在做什么类型的矩阵运算。

http://www.khronos.org/registry/ cl/specs/opencl-1.1.pdf#page=131

global_work_size 指向一个数组
work_dim 无符号值
描述全局的数量
work_dim 维度中的工作项
将执行内核函数。这
全局工作项总数是
计算方式为 global_work_size[0] *
... * global_work_size[work_dim –
1]
.

指定的值
global_work_size + 对应
global_work_offset 中指定的值
不能超出给定的范围
sizeof(size_t) 设备
内核执行将是
排队。 sizeof(size_t)
设备可以使用确定
表 4.3 中的CL_DEVICE_ADDRESS_BITS
例如,如果
CL_DEVICE_ADDRESS_BITS = 32,即
该设备使用 32 位地址
空格,size_t 是 32 位无符号
整数和 global_work_size 值
必须在 1 .. 2^32 - 1 范围内。
超出此范围的值返回
CL_OUT_OF_RESOURCES 错误。

According to the standard it doesn't have to be from what I saw. I think I would handle it with a branch, but I don't know exactly what kind of matrix operation you are doing.

http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf#page=131

global_work_size points to an array
of work_dim unsigned values that
describe the number of global
work-items in work_dim dimensions that
will execute the kernel function. The
total number of global work-items is
computed as global_work_size[0] *
... * global_work_size[work_dim –
1]
.

The values specified in
global_work_size + corresponding
values specified in global_work_offset
cannot exceed the range given by the
sizeof(size_t) for the device on
which the kernel execution will be
enqueued. The sizeof(size_t) for a
device can be determined using
CL_DEVICE_ADDRESS_BITS in table 4.3.
If, for example,
CL_DEVICE_ADDRESS_BITS = 32, i.e.
the device uses a 32-bit address
space, size_t is a 32-bit unsigned
integer and global_work_size values
must be in the range 1 .. 2^32 - 1.
Values outside this range return a
CL_OUT_OF_RESOURCES error.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文