具有该块的输出的cuda过滤器是下一个块的输入

发布于 2024-12-08 14:33:46 字数 527 浏览 0 评论 0原文

在处理以下过滤器时，我在执行这些代码段以在 GPU 中处理图像时遇到问题：

for(int h=0; h<height; h++) {
    for(int w=1; w<width; w++) {
    image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
    }
}

如果我定义：

dim3threads_perblock(32, 32)

那么我拥有的每个块：32个线程可以进行通信。该块的线程不能与其他块的线程通信。

在 thread_block 中，我可以使用共享内存来翻译该代码片段，但是，对于边缘（我想说）：不同线程块中的 image[0,31] 和 image[0,32] 。 image[0,31] 应该从 image[0,32] 获取值来计算其值。但它们位于不同的线程块中。

所以这就是问题所在。

我该如何解决这个问题？

提前致谢。

原文

Working on a filter following, I am having a problem of doing these pieces of codes for processing an image in GPU:

for(int h=0; h<height; h++) {
    for(int w=1; w<width; w++) {
    image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
    }
}

If I define:

dim3 threads_perblock(32, 32)

then each block I have: 32 threads can be communicated. The threads of this block can not communicate with the threads from other blocks.

Within a thread_block, I can translate that pieces of code using shared_memory however, for edge (I would say): image[0,31] and image[0,32] in different threadblocks. The image[0,31] should get value from image[0,32] to calculate its value. But they are in different threadblocks.

so that is the problem.

How would I solve this?

Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

奶气 2024-12-15 14:33:46

如果image位于全局内存中，那么就没有问题 - 您不需要使用共享内存，并且可以直接从image访问像素，没有任何问题。

但是，如果您在此之前已经完成了一些处理，并且image块已经在共享内存中，那么您就会遇到问题，因为您需要执行超出您的范围的邻域操作。堵塞。您可以执行以下操作之一 - 要么：

将共享内存写回全局内存，以便相邻块可以访问它（缺点：性能，块之间的同步可能很棘手）

，或者：

处理每个块的附加边缘像素，并重叠（1像素（在这种情况下），以便每个块中有额外的像素来处理边缘情况，例如使用 34x34 块大小，但仅存储 32x32 中心输出像素（缺点：需要内核内的额外逻辑，分支可能会导致扭曲不幸的是，

邻域操作在 CUDA 中可能非常棘手，并且无论使用什么方法来处理边缘情况，总会有一个缺点。

回复收藏 0 原文

夜清冷一曲。 2024-12-15 14:33:46

您可以使用繁忙的旋转（不是开玩笑）。只需让处理 a[32] 的线程

while(!variable);

在开始计算之前执行，并让处理 a[31] 的线程

variable = 1;

在完成时执行。由你来概括这一点。我知道这在 CUDA 中被认为是“流氓编程”，但这似乎是实现你想要的目标的唯一方法。我有一个非常相似的问题，它对我有用。不过你的表现可能会受到影响......
但请注意，这

dim3 threads_perblock(32, 32)

意味着每个块有 32 x 32 = 1024 个线程。

You can just use a busy spin (no joke). Just make the thread processing a[32] execute:

while(!variable);

before starting to compute and the thread processing a[31] do

variable = 1;

when it finishes. It's up to you to generalize this. I know this is considered "rogue programming" in CUDA, but it seems the only way to achieve what you want. I had a very similar problem and it worked for me. Your performance might suffer though...
Be careful however, that