为什么 OpenCL 矢量加法 Nvidia SDK 示例使用异步写入?
矢量加法示例有以下代码:
// Asynchronous write of data to GPU device
ciErr1 = clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcA, CL_FALSE, 0, sizeof(cl_float) * szGlobalWorkSize, srcA, 0, NULL, NULL);
ciErr1 |= clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcB, CL_FALSE, 0, sizeof(cl_float) * szGlobalWorkSize, srcB, 0, NULL, NULL);
shrLog("clEnqueueWriteBuffer (SrcA and SrcB)...\n");
if (ciErr1 != CL_SUCCESS)
{
shrLog("Error in clEnqueueWriteBuffer, Line %u in file %s !!!\n\n", __LINE__, __FILE__);
Cleanup(EXIT_FAILURE);
}
// Launch kernel
ciErr1 = clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 1, NULL, &szGlobalWorkSize, &szLocalWorkSize, 0, NULL, NULL);
shrLog("clEnqueueNDRangeKernel (VectorAdd)...\n");
if (ciErr1 != CL_SUCCESS)
它随后立即启动内核。这怎么能不引起问题呢?我们不能保证内核启动时图形内存缓冲区已被完全写入,对吧?
The vector addition example has this code:
// Asynchronous write of data to GPU device
ciErr1 = clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcA, CL_FALSE, 0, sizeof(cl_float) * szGlobalWorkSize, srcA, 0, NULL, NULL);
ciErr1 |= clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcB, CL_FALSE, 0, sizeof(cl_float) * szGlobalWorkSize, srcB, 0, NULL, NULL);
shrLog("clEnqueueWriteBuffer (SrcA and SrcB)...\n");
if (ciErr1 != CL_SUCCESS)
{
shrLog("Error in clEnqueueWriteBuffer, Line %u in file %s !!!\n\n", __LINE__, __FILE__);
Cleanup(EXIT_FAILURE);
}
// Launch kernel
ciErr1 = clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 1, NULL, &szGlobalWorkSize, &szLocalWorkSize, 0, NULL, NULL);
shrLog("clEnqueueNDRangeKernel (VectorAdd)...\n");
if (ciErr1 != CL_SUCCESS)
It launches the kernel right afterwards. How does this not cause problems? We aren't guaranteeing that the graphics memory buffers have been fully written to when the kernel launches right?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
虽然从主机的角度来看写入是异步的,但从设备的角度来看它们不一定是异步的。我假设命令队列是在没有 CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE 的情况下创建的,因此它是一个有序命令队列。
opencl 规范对于按顺序执行有以下规定:
因此,写入应该在设备上执行内核之前完成。
While the writes are asynchronous from a host's point of view, they aren't necessarily asynchroneous from the device's point of view. I'd assume that the commandqueue is created without CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, so it's an in-order commandqueue.
The opencl specification says the following about in-order execution:
Therefore the writes should complete before the kernel is executed on the device.