直接在主机上访问设备向量元素的最快方法
我建议您参阅以下页面 http://code.google.com/p/thrust /wiki/QuickStartGuide#Vectors。请参阅第二段,它说
另请注意,可以访问 device_vector 的各个元素 使用标准括号表示法。然而,因为这些中的每一个 访问需要调用 cudaMemcpy,应谨慎使用。 稍后我们将讨论一些更有效的技术。
我搜索了整个文档,但找不到更有效的技术。有谁知道最快的方法来做到这一点?即如何最快地访问主机上的设备向量/设备指针?
I refer you to following page http://code.google.com/p/thrust/wiki/QuickStartGuide#Vectors. Please see second paragraph where it says that
Also note that individual elements of a device_vector can be accessed
using the standard bracket notation. However, because each of these
accesses requires a call to cudaMemcpy, they should be used sparingly.
We'll look at some more efficient techniques later.
I searched all over the document but I could not find the more efficient technique. Does anyone know the fastest way to do this? i.e how to access device vector/device pointer on host fastest?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
该指南提到的“更有效的技术”是 Thrust 算法。一次访问(或通过 PCI-E 总线复制)数百万个元素比访问单个元素更有效,因为 CPU/GPU 通信的固定成本已摊销。
将数据从 GPU 复制到 CPU 没有比调用 cudaMemcpy 更快的方法了,因为这是 CUDA 程序员实现任务的最原始方法。
The "more efficient techniques" the guide alludes to are the Thrust algorithms. It's more efficient to access (or copy across the PCI-E bus) millions of elements at once than it is to access a single element because the fixed cost of CPU/GPU communication is amortized.
There's no faster way to copy data from the GPU to the CPU than by calling
cudaMemcpy
, because it is the most primitive way for a CUDA programmer to implement the task.如果您有需要进行更多处理的 device_vector,请尝试将数据保留在设备上并使用 Thrust 算法或您自己的内核对其进行处理。如果您只需要从 device_vector 中读取几个值,只需使用括号表示法直接访问这些值即可。如果您需要访问多个值,请将 device_vector 复制到 host_vector 并从那里读取值。
If you have a device_vector which you need to do more processing on, try to keep the data on the device and process it with Thrust algorithms or your own kernels. If you need to read only a few values from the device_vector, just access the values directly with bracket notation. If you need to access more than a few values, copy the device_vector over to a host_vector and read the the values from there.