OpenCL 中的内存管理

发布于 2024-09-13 09:08:20 字数 666 浏览 3 评论 0原文

当我开始使用 OpenCL 编程时，我使用以下方法向内核提供数据：

cl_mem buff = clCreateBuffer(cl_ctx, CL_MEM_READ_WRITE, object_size, NULL, NULL);
clEnqueueWriteBuffer(cl_queue, buff, CL_TRUE, 0, object_size, (void *) object, NULL, NULL, NULL);

这显然需要我将数据分区为块，确保每个块适合设备内存。执行计算后，我使用 clEnqueueReadBuffer() 读出数据。然而，在某些时候我意识到我可以只使用以下行：

cl_mem buff = clCreateBuffer(cl_ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, object_size, (void*) object, NULL);

这样做时，数据分区变得过时了。令我惊讶的是，我的表现得到了巨大的提升。这是我不明白的事情。据我所知，当使用主机指针时，设备内存充当缓存，但所有数据仍然需要复制到其中进行处理，然后在完成后复制回主内存。为什么使用显式复制（ clEnqueRead/WriteBuffer ）会慢一个数量级，而在我看来它应该基本相同？我错过了什么吗？

谢谢。

原文

When I started programming in OpenCL I used the following approach for providing data to my kernels:

cl_mem buff = clCreateBuffer(cl_ctx, CL_MEM_READ_WRITE, object_size, NULL, NULL);
clEnqueueWriteBuffer(cl_queue, buff, CL_TRUE, 0, object_size, (void *) object, NULL, NULL, NULL);

This obviously required me to partition my data in chunks, ensuring that each chunk would fit into the device memory. After performing the computations, I'd read out the data with clEnqueueReadBuffer(). However, at some point I realised I could just use the following line:

cl_mem buff = clCreateBuffer(cl_ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, object_size, (void*) object, NULL);

When doing this, the partitioning of the data became obsolete. And to my surprise, I experienced a great boost in performance. That is something I don't understand. From what I got, when using a host pointer, the device memory is working as a cache, but all the data still needs to be copied to it for processing and then copied back to main memory once finished. How come using an explicit copy ( clEnqueRead/WriteBuffer ) is an order of magnitude slower, when in my mind it should be basically the same? Am I missing something?

Thanks.

分享到QQ

分享到微博