CUDA 流不重叠
我有一些与代码非常相似的东西:
int k, no_streams = 4;
cudaStream_t stream[no_streams];
for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]);
cudaMalloc(&g_in, size1*no_streams);
cudaMalloc(&g_out, size2*no_streams);
for (k = 0; k < no_streams; k++)
cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]);
for (k = 0; k < no_streams; k++)
mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof(float));
for (k = 0; k < no_streams; k++)
cudaMemcpyAsync(h_ptr_out[k], g_out+k*size2/sizeof(float), size2, cudaMemcpyDeviceToHost, stream[k]);
cudaThreadSynchronize();
cudaFree(g_in);
cudaFree(g_out);
'h_ptr_in'和'h_ptr_out'是用cudaMallocHost分配的指针数组(没有标志)。
问题是流不重叠。 在可视化分析器中,我可以看到第一个流中的内核执行与第二个流中的副本(H2D)重叠,但没有其他重叠。
我可能没有资源来运行 2 个内核(我想我有),但至少内核执行和复制应该重叠,对吗? 如果我将所有 3 个(复制 H2D、内核执行、复制 D2H)放在同一个 for 循环中,则它们都不重叠...
请帮忙,是什么原因导致的?
我正在运行:
Ubuntu 10.04 x64
设备:“GeForce GTX 460” (CUDA 驱动程序版本:3.20, CUDA 运行时版本:3.20, CUDA 能力主要/次要版本号:2.1, 并发复制和执行:是的, 并发内核执行:是)
I have something very similar to the code:
int k, no_streams = 4;
cudaStream_t stream[no_streams];
for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]);
cudaMalloc(&g_in, size1*no_streams);
cudaMalloc(&g_out, size2*no_streams);
for (k = 0; k < no_streams; k++)
cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]);
for (k = 0; k < no_streams; k++)
mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof(float));
for (k = 0; k < no_streams; k++)
cudaMemcpyAsync(h_ptr_out[k], g_out+k*size2/sizeof(float), size2, cudaMemcpyDeviceToHost, stream[k]);
cudaThreadSynchronize();
cudaFree(g_in);
cudaFree(g_out);
'h_ptr_in' and 'h_ptr_out' are arrays of pointers allocated with cudaMallocHost (with no flags).
The problem is that the streams do not overlap.
In the visual profiler I can see the kernel execution from the first stream overlapping with the copy (H2D) from the second stream but nothing else overlaps.
I may not have resources to run 2 kernels (I think I do) but at least the kernel execution and copy should be overlaping, right?
And if I put all 3 (copy H2D, kernel execution, copy D2H) within the same for-loop none of them overlap...
Please HELP, what can be causing this?
I'm running on:
Ubuntu 10.04 x64
Device: "GeForce GTX 460"
(CUDA Driver Version: 3.20,
CUDA Runtime Version: 3.20,
CUDA Capability Major/Minor version number: 2.1,
Concurrent copy and execution: Yes,
Concurrent kernel execution: Yes)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
根据 NVIDIA 论坛上的这篇文章,分析器将对流进行序列化以获得准确的结果计时数据。如果您认为您的计时不对,请确保您正在使用 CUDA 事件...
我最近一直在尝试流式处理,我发现 SDK 中的“simpleMultiCopy”示例确实很有帮助,特别是在适当的逻辑和同步方面。
According to this post on the NVIDIA forums, the profiler will serialize streaming to get accurate timing data. If you think your timings are off, make sure you're using CUDA events...
I've been experimenting with streaming lately, and I found the "simpleMultiCopy" example from the SDK to be really helpful, particularly with the appropriate logic and synchronizations.
如果您想查看内核与内核的重叠(并发内核),您需要使用 CUDA 5.0 Toolkit 附带的 CUDA Visual profiler 5.0。我认为以前的分析器没有能力做到这一点。它还应该显示内核和 memcpy 重叠。
If you want to see the kernels overlap with kernels (concurrent kernels) you need to make use of CUDA Visual profiler 5.0 that comes with CUDA 5.0 Toolkit. I don't think previous profilers are capable of this. It should also show kernel and memcpy overlap.