Cuda 用 2D 块替换 double
我对 CUDA 很陌生,一直在尝试遍历 2D 数组。我有以下代码,可以在普通 C 上按预期工作:
for (ty=0;ty<s;ty++){
if (ty+pixY < s && ty+pixY>=0){
for(tx=0;tx<r;tx++){
T[ty/3][tx/3] += (tx+pixX<s && tx+pixX>=0) ?
*(image +M*(ty+pixY)+tx+pixX) * *(filter+fw*(ty%3)+tx%3) : 0;
}
}
}
也许我遇到了问题,但是这段代码不会转换为 CUDA 吗?
tx = threadIdx.x;
ty = threadIdy.y;
T[ty/3][tx/3] += (tx+pixX<s && tx+pixX>=0) ?
*(image +M*(ty+pixY)+tx+pixX) * *(filter+fw*(ty%3)+tx%3) : 0;
假设我已将内核参数定义为 dimGrid(1,1,1)
和 blockDim(r,s,1)
我之所以这样问,是因为我得到了意外的结果。另外,如果我正确地将数组声明并分配为 2D cuda 数组,而不仅仅是一个大的 1D 数组,这会有帮助吗?
感谢您的帮助。
I'm really new to CUDA and have been trying to traverse a 2D array. I have the following code which works as expected on plain C:
for (ty=0;ty<s;ty++){
if (ty+pixY < s && ty+pixY>=0){
for(tx=0;tx<r;tx++){
T[ty/3][tx/3] += (tx+pixX<s && tx+pixX>=0) ?
*(image +M*(ty+pixY)+tx+pixX) * *(filter+fw*(ty%3)+tx%3) : 0;
}
}
}
Maybe I'm getting something wrong but wouldn't this code translate to CUDA as following?
tx = threadIdx.x;
ty = threadIdy.y;
T[ty/3][tx/3] += (tx+pixX<s && tx+pixX>=0) ?
*(image +M*(ty+pixY)+tx+pixX) * *(filter+fw*(ty%3)+tx%3) : 0;
provided I have defined my kernel parameters as dimGrid(1,1,1)
and blockDim(r,s,1)
I ask because I'm getting unexpected results. Also if I properly declare and allocate my arrays as 2D cuda arrays instead of just a big 1D array will this help?
Thanks for your help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
抛开数组分配和索引方案是否正确(我不确定帖子中有足够的信息来确认这一点),以及整数除法和取模很慢并且应该避免的事实,你有一个更根本的问题- 记忆竞赛。
您正在使用的单个块中的多个线程将尝试同时读取和写入 T 的同一条目。 CUDA 不保证此类操作的正确性,并且几乎可以肯定它不会起作用。最简单的替代方案是仅使用单个线程来计算每个
T[][]
条目,而不是三个线程。这消除了内存竞争。Leaving aside whether the array allocation and indexing schemes are correct (I am not sure there is enough information in the post to confirm that), and the fact that integer division and modulo are slow and should be avoided, you have a much more fundamental problem - a memory race.
Multiple threads within the single block you are using will be attempting to read and write to the same entry of T at the same time. CUDA makes no guarantees about the correctness of this sort of operation and it is almost certainly not going to work. The simplest alternative is to only use a single thread to compute each
T[][]
entry, rather than three threads. This eliminates the memory race.