当前位置：文江博客话题详情

CUDA 中的异步 memcpy 期间的设备同步

发布于 2024-11-29 09:55:01 字数 64 浏览 6 评论 0原文

假设我想在 CUDA 中执行异步 memcpy 主机到设备，然后立即运行内核。如何在内核中测试异步传输是否已完成？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

自演自醉 2024-12-06 09:55:01

使用 CUDA“流”对异步复制和内核启动进行排序可确保内核在异步传输完成后执行。下面的代码示例演示了：

#include <stdio.h>

__global__ void kernel(const int *ptr)
{
  printf("Hello, %d\n", *ptr);
}

int main()
{
  int *h_ptr = 0;

  // allocate pinned host memory with cudaMallocHost
  // pinned memory is required for asynchronous copy
  cudaMallocHost(&h_ptr, sizeof(int));

  // look for thirteen in the output
  *h_ptr = 13;

  // allocate device memory
  int *d_ptr = 0;
  cudaMalloc(&d_ptr, sizeof(int));

  // create a stream
  cudaStream_t stream;
  cudaStreamCreate(&stream);

  // sequence the asynchronous copy on our stream
  cudaMemcpyAsync(d_ptr, h_ptr, sizeof(int), cudaMemcpyHostToDevice, stream);

  // sequence the kernel on our stream after the copy
  // the kernel will execute after the copy has completed
  kernel<<<1,1,0,stream>>>(d_ptr);

  // clean up after ourselves
  cudaStreamDestroy(stream);
  cudaFree(d_ptr);
  cudaFreeHost(h_ptr);
}

输出：

$ nvcc -arch=sm_20 async.cu -run
Hello, 13

我不相信有任何受支持的方法可以从内核内部测试是否满足某些异步条件（例如异步传输的完成）。假定 CUDA 线程块完全独立于其他执行线程执行。

Sequencing your asynchronous copy and kernel launch using a CUDA "stream" ensures that the kernel executes after the asynchronous transfer has completed. The following code example demonstrates:

#include <stdio.h>

__global__ void kernel(const int *ptr)
{
  printf("Hello, %d\n", *ptr);
}

int main()
{
  int *h_ptr = 0;

  // allocate pinned host memory with cudaMallocHost
  // pinned memory is required for asynchronous copy
  cudaMallocHost(&h_ptr, sizeof(int));

  // look for thirteen in the output
  *h_ptr = 13;

  // allocate device memory
  int *d_ptr = 0;
  cudaMalloc(&d_ptr, sizeof(int));

  // create a stream
  cudaStream_t stream;
  cudaStreamCreate(&stream);

  // sequence the asynchronous copy on our stream
  cudaMemcpyAsync(d_ptr, h_ptr, sizeof(int), cudaMemcpyHostToDevice, stream);

  // sequence the kernel on our stream after the copy
  // the kernel will execute after the copy has completed
  kernel<<<1,1,0,stream>>>(d_ptr);

  // clean up after ourselves
  cudaStreamDestroy(stream);
  cudaFree(d_ptr);
  cudaFreeHost(h_ptr);
}

And the output:

$ nvcc -arch=sm_20 async.cu -run
Hello, 13

I don't believe there's any supported way to test from within a kernel whether some asynchronous condition (such as the completion of an asynchronous transfer) has been met. CUDA thread blocks are assumed to execute completely independently from other threads of execution.

回复收藏 0 原文

~没有更多了~