通过CUDA线程复制全局内存

发布于 2024-10-12 04:43:34 字数 473 浏览 3 评论 0原文

我需要通过 CUDA 线程(而不是从主机)将全局内存中的一个数组复制到全局内存中的另一个数组。

我的代码如下:

__global__ void copy_kernel(int *g_data1, int *g_data2, int n)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  int start, end;
  start = some_func(idx);
  end = another_func(idx);
  unsigned int i;
  for (i = start; i < end; i++) {
      g_data2[i] = g_data1[idx];
  }
}

效率非常低,因为对于某些idx,[start, end]区域非常大,这使得该线程发出太多复制命令。有没有什么办法可以高效实施呢?

谢谢你,

I need to copy one array in global memory to another array in global memory by CUDA threads (not from the host).

My code is as follows:

__global__ void copy_kernel(int *g_data1, int *g_data2, int n)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  int start, end;
  start = some_func(idx);
  end = another_func(idx);
  unsigned int i;
  for (i = start; i < end; i++) {
      g_data2[i] = g_data1[idx];
  }
}

It is very inefficient because for some idx, the [start, end] region is very large, which makes that thread issue too many copy commands. Is there any way to implement it efficiently?

Thank you,

Zheng

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

痴意少年 2024-10-19 04:43:34

按照你写的方式,我猜测每个线程都在尝试写入整个“开始”到“结束”块。这确实非常效率低下。

你需要做这样的事情。

___shared___ unsigned sm_start[BLOCK_SIZE];
___shared___ unsigned sm_end[BLOCK_SIZE];
sm_start[threadIdx.x] = start;
sm_end[threadIdx.y] = end;
__syncthreads();
for (int n = 0; n < blockdDim.x; n++) {
g_data2 += sm_start[n];
unsigned lim = sm_end[n] - sm_start[n];
  for (int i = threadIdx.x; i < lim; i += blockDim.x) {
      g_data2[i] = g_data1[idx];
  }
}

The way you wrote it, I am guessing each thread is trying to write the whole 'start' to 'end' chunk. Which is really really inefficient.

you need to do something like this.

___shared___ unsigned sm_start[BLOCK_SIZE];
___shared___ unsigned sm_end[BLOCK_SIZE];
sm_start[threadIdx.x] = start;
sm_end[threadIdx.y] = end;
__syncthreads();
for (int n = 0; n < blockdDim.x; n++) {
g_data2 += sm_start[n];
unsigned lim = sm_end[n] - sm_start[n];
  for (int i = threadIdx.x; i < lim; i += blockDim.x) {
      g_data2[i] = g_data1[idx];
  }
}
我的痛♀有谁懂 2024-10-19 04:43:34

尝试使用这个:

CUresult cuMemcpyDtoD(
    CUdeviceptr dst,
    CUdeviceptr src,
    unsigned int bytes   
)   

更新:

你是对的:http://forums.nvidia.com/ index.php?showtopic=88745

没有有效的方法可以正确执行此操作,因为 CUDA 的设计希望您仅使用内核中的少量数据。

try using this:

CUresult cuMemcpyDtoD(
    CUdeviceptr dst,
    CUdeviceptr src,
    unsigned int bytes   
)   

UPDATE:

You're right: http://forums.nvidia.com/index.php?showtopic=88745

There is no efficient way to do this properly because the design of CUDA wants you to use only small amount of data in the kernel.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文