使用 cudaHostAlloc 对我的情况有好处
我多次启动内核,直到找到解决方案。至少有一个方块可以找到解决方案。
因此,当一个块找到解决方案时,它应该通知CPU找到了解决方案,以便CPU打印该块提供的解决方案。
所以我目前正在做的事情如下:
__global__ kernel(int sol)
{
//do some computations
if(the block found a solution)
sol = blockId.x //atomically
}
现在,在每次调用内核时,我都会将 sol 复制回主机内存并检查其值。例如,如果它设置为 3,我知道 blockid 3 找到了解决方案,所以我现在知道解决方案的索引从哪里开始,并将解决方案复制回主机。
在这种情况下,使用 cudaHostAlloc 会是更好的选择吗?更重要的是,在每个内核调用上复制单个整数的值是否会减慢我的程序速度?
i have a kernel launched several times, untill a solution is found. the solution will be found by at least one block.
therefore when a block finds the solution it should inform the cpu that the solution is found, so the cpu prints the solution provided by this block.
so what i am currently doing is the following:
__global__ kernel(int sol)
{
//do some computations
if(the block found a solution)
sol = blockId.x //atomically
}
now on every call to the kernel i copy sol back to the host memory and check its value. if its set to 3 for example, i know that blockid 3 found the solution so i now know where the index of the solution start, and copy the solution back to the host.
in this case, will using cudaHostAlloc be a better option? more over would copying the value of a single integer on every kernel call slows down my program?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
从 GPU 向 CPU 发出副本,然后等待其完成会降低程序速度一点。请注意,如果您选择发送 1 字节或 1KB,则不会产生太大差异。在这种情况下,带宽不是问题,而是延迟。
但启动内核确实也会消耗一些时间。如果你的算法的“核心”在内核本身中,我不会在这个单一的小传输上花费太多时间。
请注意,如果您选择使用映射内存,而不是使用
cudaMemcpy
,您将需要显式放置cudaDeviceSynchronise
(或cudaThreadSynchronise
)在读取状态之前,旧版 CUDA)屏障(与 cudaMemcpy 处的隐式屏障相反)。否则,您的主机代码可能会在内核覆盖它之前继续读取存储在固定内存中的旧值。Issuing a copy from GPU to CPU and then waiting for its completion will slow your program a bit. Note that if you choose to send 1 byte or 1KB, that won't make much of a difference. In this case bandwidth is not a problem, but latency.
But launching a kernel does consume some time as well. If the "meat" of your algorithm is in the kernel itself I wouldn't spend too much time on that single, small transfer.
Do note, if you choose to use the mapped memory, instead of using
cudaMemcpy
, you will need to explicitly put acudaDeviceSynchronise
(orcudaThreadSynchronise
with older CUDA) barrier (as opposed to an implicit barrier atcudaMemcpy
) before reading the status. Otherwise, your host code may go achead reading an old value stored in your pinned memory, before the kernel overwrites it.