将整数从 GPU 复制到 CPU
我需要在每次内核调用之后将单个布尔值或整数值从设备复制到主机(我在 for 循环中调用相同的内核)。也就是说,在每次内核调用之后,我需要将一个整数或布尔值发送回主机。最好的方法是什么?
我应该将值直接写入 RAM 吗?或者我应该使用 cudaMemcpy() 吗?或者还有其他方法可以做到这一点吗?每次内核启动后仅复制 1 个整数是否会减慢我的程序速度?
I need to copy a single boolean or an integer value from the device to the host after every kernel call (I am calling the same kernel in a for loop). That is, after every kernel call, I need to send an integer or a boolean back to the host. What is the best way to do this?
Should I write the value directly to RAM? Or should I use cudaMemcpy()? Or is there any other way to do this? Would copying just 1 integer after every kernel launch slow down my program?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
让我首先回答您的最后一个问题:
每次内核启动后仅复制 1 个整数是否会减慢我的程序速度?
有点 - 是的。发出命令,等待 GPU 响应,等等……在这种情况下,数据量(1 个整数 vs 100 个整数)可能并不重要。但是,您仍然可以实现每秒数千次内存传输的速度。最有可能的是,您的内核将比此单个内存传输慢(否则,在 CPU 上完成整个任务可能会更好)
执行此操作的最佳方法是什么?
好吧,我会建议自己尝试一下。正如您所说:您可以使用映射固定内存并让内核将值直接存储到 RAM,或者使用 cudaMemcpy。如果您的内核在发送回整数后仍然有一些工作要做,那么第一个可能会更好。在这种情况下,将其发送到主机的延迟可能会被内核的执行隐藏。
如果您使用第一种方法,则必须调用 cudaThreadsynchronize() 来确保内核结束其执行。内核调用是异步的。
您可以使用也是异步的
cudaMemcpyAsync
,但 GPU 无法让内核运行并并行执行cudaMemcpyAsync
,除非您使用流。我从来没有真正尝试过,但是如果循环执行太多次而你的程序不会崩溃,你可能会尝试忽略同步并让它迭代,直到在 RAM 中看到特殊值。在该解决方案中,内存传输可能完全隐藏,您只需在最后支付开销。然而,您需要以某种方式防止循环迭代太多次,CUDA 事件可能会有所帮助。
Let me first answer your last question:
Would copying just 1 integer after every kernel launch slow down my program?
A bit - yes. Issuing the command, waiting for GPU to respond, etc, etc... The amount of data (1 int vs 100 ints) probably doesn't really matter in this case. However, you can still achieve speeds of thousands memory transfers per second. Most likely, your kernel will be slower than this single memory transfer (otherwise, it would be probably better to do the whole task on a CPU)
what is the best way to do this?
Well, I would suggest simply trying it yourself. As you said: you can either use mapped-pinned memory and have your kernel store the value directly to RAM, or use cudaMemcpy. The first one might be better if your kernels still have some work to do after sending the integer back. In that case, the latency of sending it to host could be hidden by the execution of the kernel.
If you use the first method, you will have to call
cudaThreadsynchronize()
to make sure the kernel ended its execution. Kernel calls are asynchronous.You can use
cudaMemcpyAsync
which is also asynchronous, but GPU cannot have kernel running and havingcudaMemcpyAsync
executed in parallel, unless you use streams.I never actually tried that, but if your program won't crash if the loop executes too many times, you might try to ignore synchronisation and let it iterate until the special value is seen in RAM. In that solution, the memory transfer might be completely hidden and you would pay an overhead only at the end. You will need however to somehow prevent the loop from iterating too many times, CUDA events may be helpful.
为什么不使用固定内存?如果您的系统支持,请参阅 CUDA C 编程指南中有关固定内存的部分。
Why not use pinned memory? If your system supports it -- see CUDA C Programming Guide's section on pinned memory.
将数据复制到 GPU 或从 GPU 复制数据比从 CPU 访问数据要慢得多。如果您没有为此值运行大量线程,那么这将导致性能非常慢,请不要这样做。
您所描述的听起来像是一个串行算法,您的算法需要并行化才能使其值得使用 CUDA 来完成。如果你不能重写你的算法,成为单次写入多个数据到GPU,多个线程,单次将多个数据写入回CPU;那么你的算法应该在CPU上完成。
Copying data to and from the GPU will be much slower than accessing the data from the CPU. If you are not running a significant number of threads for this value then this will result in very slow performance, don't do it.
What you are describing sounds like a serial algorithm, your algorithm needs to be parallelised in order to make it worth doing using CUDA. If you can't rewrite your algorithm to become a single write of multiple data to the GPU, multiple threads, single write of multiple data back to CPU; then your algorithm should be done on CPU.
如果您需要在上一个内核调用中计算的值来启动下一个内核调用,则将其序列化,您的选择是 cudaMemcpy(dst,src, size =1, ...);
如果所有内核启动参数不依赖于先前的启动,因此您可以将每次内核调用的所有结果存储在 GPU 内存中,然后立即下载所有结果。
If you need the value computed in the previous kernel call to launch the next one then is serialized and your choice is to cudaMemcpy(dst,src, size =1, ...);
If all the kernel launch parameters do not depend on the previous launch then you can store all the result of each kernel invocation in GPU memory and then download all the results at once.