Cuda 用 2D 块替换 double

发布于 2024-11-08 13:45:24 字数 799 浏览 0 评论 0原文

我对 CUDA 很陌生,一直在尝试遍历 2D 数组。我有以下代码,可以在普通 C 上按预期工作:

for (ty=0;ty<s;ty++){
        if (ty+pixY < s && ty+pixY>=0){
            for(tx=0;tx<r;tx++){
                T[ty/3][tx/3] += (tx+pixX<s && tx+pixX>=0) ? 
                *(image +M*(ty+pixY)+tx+pixX) * *(filter+fw*(ty%3)+tx%3) : 0;
            }
        }
    }  

也许我遇到了问题,但是这段代码不会转换为 CUDA 吗?

tx = threadIdx.x;
ty = threadIdy.y;

T[ty/3][tx/3] += (tx+pixX<s && tx+pixX>=0) ?
                *(image +M*(ty+pixY)+tx+pixX) * *(filter+fw*(ty%3)+tx%3) : 0; 

假设我已将内核参数定义为 dimGrid(1,1,1)blockDim(r,s,1)

我之所以这样问,是因为我得到了意外的结果。另外,如果我正确地将数组声明并分配为 2D cuda 数组,而不仅仅是一个大的 1D 数组,这会有帮助吗?

感谢您的帮助。

I'm really new to CUDA and have been trying to traverse a 2D array. I have the following code which works as expected on plain C:

for (ty=0;ty<s;ty++){
        if (ty+pixY < s && ty+pixY>=0){
            for(tx=0;tx<r;tx++){
                T[ty/3][tx/3] += (tx+pixX<s && tx+pixX>=0) ? 
                *(image +M*(ty+pixY)+tx+pixX) * *(filter+fw*(ty%3)+tx%3) : 0;
            }
        }
    }  

Maybe I'm getting something wrong but wouldn't this code translate to CUDA as following?

tx = threadIdx.x;
ty = threadIdy.y;

T[ty/3][tx/3] += (tx+pixX<s && tx+pixX>=0) ?
                *(image +M*(ty+pixY)+tx+pixX) * *(filter+fw*(ty%3)+tx%3) : 0; 

provided I have defined my kernel parameters as dimGrid(1,1,1) and blockDim(r,s,1)

I ask because I'm getting unexpected results. Also if I properly declare and allocate my arrays as 2D cuda arrays instead of just a big 1D array will this help?

Thanks for your help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

夏日浅笑〃 2024-11-15 13:45:25

抛开数组分配和索引方案是否正确(我不确定帖子中有足够的信息来确认这一点),以及整数除法和取模很慢并且应该避免的事实,你有一个更根本的问题- 记忆竞赛。

您正在使用的单个块中的多个线程将尝试同时读取和写入 T 的同一条目。 CUDA 不保证此类操作的正确性,并且几乎可以肯定它不会起作用。最简单的替代方案是仅使用单个线程来计算每个 T[][] 条目,而不是三个线程。这消除了内存竞争。

Leaving aside whether the array allocation and indexing schemes are correct (I am not sure there is enough information in the post to confirm that), and the fact that integer division and modulo are slow and should be avoided, you have a much more fundamental problem - a memory race.

Multiple threads within the single block you are using will be attempting to read and write to the same entry of T at the same time. CUDA makes no guarantees about the correctness of this sort of operation and it is almost certainly not going to work. The simplest alternative is to only use a single thread to compute each T[][] entry, rather than three threads. This eliminates the memory race.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文