CUDA 流不重叠

发布于 2024-11-08 15:15:49 字数 1174 浏览 3 评论 0原文

我有一些与代码非常相似的东西：

int k, no_streams = 4;
cudaStream_t stream[no_streams];
for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]);

cudaMalloc(&g_in,  size1*no_streams);
cudaMalloc(&g_out, size2*no_streams);

for (k = 0; k < no_streams; k++)
  cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]);

for (k = 0; k < no_streams; k++)
  mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof(float));

for (k = 0; k < no_streams; k++)
  cudaMemcpyAsync(h_ptr_out[k], g_out+k*size2/sizeof(float), size2, cudaMemcpyDeviceToHost, stream[k]);

cudaThreadSynchronize();

cudaFree(g_in);
cudaFree(g_out);

'h_ptr_in'和'h_ptr_out'是用cudaMallocHost分配的指针数组（没有标志）。

问题是流不重叠。在可视化分析器中，我可以看到第一个流中的内核执行与第二个流中的副本（H2D）重叠，但没有其他重叠。

我可能没有资源来运行 2 个内核（我想我有），但至少内核执行和复制应该重叠，对吗？如果我将所有 3 个（复制 H2D、内核执行、复制 D2H）放在同一个 for 循环中，则它们都不重叠...

请帮忙，是什么原因导致的？

我正在运行：

Ubuntu 10.04 x64

设备：“GeForce GTX 460” （CUDA 驱动程序版本：3.20， CUDA 运行时版本：3.20， CUDA 能力主要/次要版本号：2.1，并发复制和执行：是的，并发内核执行：是）

原文

I have something very similar to the code:

int k, no_streams = 4;
cudaStream_t stream[no_streams];
for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]);

cudaMalloc(&g_in,  size1*no_streams);
cudaMalloc(&g_out, size2*no_streams);

for (k = 0; k < no_streams; k++)
  cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]);

for (k = 0; k < no_streams; k++)
  mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof(float));

for (k = 0; k < no_streams; k++)
  cudaMemcpyAsync(h_ptr_out[k], g_out+k*size2/sizeof(float), size2, cudaMemcpyDeviceToHost, stream[k]);

cudaThreadSynchronize();

cudaFree(g_in);
cudaFree(g_out);

'h_ptr_in' and 'h_ptr_out' are arrays of pointers allocated with cudaMallocHost (with no flags).

The problem is that the streams do not overlap.
In the visual profiler I can see the kernel execution from the first stream overlapping with the copy (H2D) from the second stream but nothing else overlaps.

I may not have resources to run 2 kernels (I think I do) but at least the kernel execution and copy should be overlaping, right?
And if I put all 3 (copy H2D, kernel execution, copy D2H) within the same for-loop none of them overlap...

Please HELP, what can be causing this?

I'm running on:

Ubuntu 10.04 x64

Device: "GeForce GTX 460"
(CUDA Driver Version: 3.20,
CUDA Runtime Version: 3.20,
CUDA Capability Major/Minor version number: 2.1,
Concurrent copy and execution: Yes,
Concurrent kernel execution: Yes)

分享到QQ

分享到微博