通过CUDA线程复制全局内存
我需要通过 CUDA 线程(而不是从主机)将全局内存中的一个数组复制到全局内存中的另一个数组。
我的代码如下:
__global__ void copy_kernel(int *g_data1, int *g_data2, int n)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int start, end;
start = some_func(idx);
end = another_func(idx);
unsigned int i;
for (i = start; i < end; i++) {
g_data2[i] = g_data1[idx];
}
}
效率非常低,因为对于某些idx,[start, end]区域非常大,这使得该线程发出太多复制命令。有没有什么办法可以高效实施呢?
谢谢你,
郑
I need to copy one array in global memory to another array in global memory by CUDA threads (not from the host).
My code is as follows:
__global__ void copy_kernel(int *g_data1, int *g_data2, int n)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int start, end;
start = some_func(idx);
end = another_func(idx);
unsigned int i;
for (i = start; i < end; i++) {
g_data2[i] = g_data1[idx];
}
}
It is very inefficient because for some idx, the [start, end] region is very large, which makes that thread issue too many copy commands. Is there any way to implement it efficiently?
Thank you,
Zheng
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
按照你写的方式,我猜测每个线程都在尝试写入整个“开始”到“结束”块。这确实非常效率低下。
你需要做这样的事情。
The way you wrote it, I am guessing each thread is trying to write the whole 'start' to 'end' chunk. Which is really really inefficient.
you need to do something like this.
尝试使用这个:
更新:
你是对的:http://forums.nvidia.com/ index.php?showtopic=88745
没有有效的方法可以正确执行此操作,因为 CUDA 的设计希望您仅使用内核中的少量数据。
try using this:
UPDATE:
You're right: http://forums.nvidia.com/index.php?showtopic=88745
There is no efficient way to do this properly because the design of CUDA wants you to use only small amount of data in the kernel.