OpenCL 中的内存管理

发布于 2024-09-13 09:08:20 字数 666 浏览 3 评论 0原文

当我开始使用 OpenCL 编程时,我使用以下方法向内核提供数据:

cl_mem buff = clCreateBuffer(cl_ctx, CL_MEM_READ_WRITE, object_size, NULL, NULL);
clEnqueueWriteBuffer(cl_queue, buff, CL_TRUE, 0, object_size, (void *) object, NULL, NULL, NULL);

这显然需要我将数据分区为块,确保每个块适合设备内存。执行计算后,我使用 clEnqueueReadBuffer() 读出数据。然而,在某些时候我意识到我可以只使用以下行:

cl_mem buff = clCreateBuffer(cl_ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, object_size, (void*) object, NULL);

这样做时,数据分区变得过时了。令我惊讶的是,我的表现得到了巨大的提升。这是我不明白的事情。据我所知,当使用主机指针时,设备内存充当缓存,但所有数据仍然需要复制到其中进行处理,然后在完成后复制回主内存。为什么使用显式复制( clEnqueRead/WriteBuffer )会慢一个数量级,而在我看来它应该基本相同?我错过了什么吗?

谢谢。

When I started programming in OpenCL I used the following approach for providing data to my kernels:

cl_mem buff = clCreateBuffer(cl_ctx, CL_MEM_READ_WRITE, object_size, NULL, NULL);
clEnqueueWriteBuffer(cl_queue, buff, CL_TRUE, 0, object_size, (void *) object, NULL, NULL, NULL);

This obviously required me to partition my data in chunks, ensuring that each chunk would fit into the device memory. After performing the computations, I'd read out the data with clEnqueueReadBuffer(). However, at some point I realised I could just use the following line:

cl_mem buff = clCreateBuffer(cl_ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, object_size, (void*) object, NULL);

When doing this, the partitioning of the data became obsolete. And to my surprise, I experienced a great boost in performance. That is something I don't understand. From what I got, when using a host pointer, the device memory is working as a cache, but all the data still needs to be copied to it for processing and then copied back to main memory once finished. How come using an explicit copy ( clEnqueRead/WriteBuffer ) is an order of magnitude slower, when in my mind it should be basically the same? Am I missing something?

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

谁的年少不轻狂 2024-09-20 09:08:20

是的,您在 clEnqueueWriteBuffer 调用中缺少 CL_TRUE。这会导致写操作阻塞,从而在进行复制时使 CPU 停顿。使用主机指针,OpenCL 实现可以通过使其异步来“优化”副本,因此总体性能更好。

请注意,这取决于 CL 实现,并且不能保证会更快/等于/更慢。

Yes, you're missing the CL_TRUE in the clEnqueueWriteBuffer call. This makes the write operation blocking, which stalls the CPU while the copy is made. Using the host pointer, the OpenCL implementation can "optimize" the copy by making it asynchronous, thus in overall the performance is better.

Note that this depends on the CL implementation, and there's no guarantee that will be faster/equal/slower.

怪我鬧 2024-09-20 09:08:20

在某些情况下,CPU 和 GPU 可以共享相同的物理 DRAM 内存。例如,如果内存块满足 CPU 和 GPU 对齐规则,则 Intel 会将 CL_MEM_USE_HOST_PTR 解释为在 CPU 和 GPU 之间共享物理 DRAM 的权限,因此不会实际复制数据。显然,这速度非常快!

以下是解释它的链接:

https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-如何通过最小化英特尔处理器显卡上的缓冲区副本来提高性能

PS 我知道我的回复对于OP来说太旧了,但其他读者可能会感兴趣。

In some cases the CPU and GPU can share the same physical DRAM memory. For example, if the memory block satisfies CPU and GPU alignment rules then Intel interprets CL_MEM_USE_HOST_PTR as permission to share physical DRAM between CPU and GPU, so there is no actual copying of data. Obviously, that's very fast!

Here is a link that explains it:

https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics

PS I know my reply is far too old for OP, but other readers may be interested.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文