具有该块的输出的cuda过滤器是下一个块的输入

发布于 2024-12-08 14:33:46 字数 527 浏览 0 评论 0原文

在处理以下过滤器时,我在执行这些代码段以在 GPU 中处理图像时遇到问题:

for(int h=0; h<height; h++) {
    for(int w=1; w<width; w++) {
    image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
    }
}

如果我定义:

dim3threads_perblock(32, 32)

那么我拥有的每个块:32个线程可以进行通信。该块的线程不能与其他块的线程通信。

在 thread_block 中,我可以使用共享内存来翻译该代码片段,但是,对于边缘(我想说):不同线程块中的 image[0,31] 和 image[0,32] 。 image[0,31] 应该从 image[0,32] 获取值来计算其值。但它们位于不同的线程块中。

所以这就是问题所在。

我该如何解决这个问题?

提前致谢。

Working on a filter following, I am having a problem of doing these pieces of codes for processing an image in GPU:

for(int h=0; h<height; h++) {
    for(int w=1; w<width; w++) {
    image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
    }
}

If I define:

dim3 threads_perblock(32, 32)

then each block I have: 32 threads can be communicated. The threads of this block can not communicate with the threads from other blocks.

Within a thread_block, I can translate that pieces of code using shared_memory however, for edge (I would say): image[0,31] and image[0,32] in different threadblocks. The image[0,31] should get value from image[0,32] to calculate its value. But they are in different threadblocks.

so that is the problem.

How would I solve this?

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

奶气 2024-12-15 14:33:46

如果image位于全局内存中,那么就没有问题 - 您不需要使用共享内存,并且可以直接从image访问像素,没有任何问题。

但是,如果您在此之前已经完成了一些处理,并且image块已经在共享内存中,那么您就会遇到问题,因为您需要执行超出您的范围的邻域操作。堵塞。您可以执行以下操作之一 - 要么:

  • 将共享内存写回全局内存,以便相邻块可以访问它(缺点:性能,块之间的同步可能很棘手)

,或者:

  • 处理每个块的附加边缘像素,并重叠(1像素(在这种情况下),以便每个块中有额外的像素来处理边缘情况,例如使用 34x34 块大小,但仅存储 32x32 中心输出像素(缺点:需要内核内的额外逻辑,分支可能会导致扭曲 不幸的是,

邻域操作在 CUDA 中可能非常棘手,并且无论使用什么方法来处理边缘情况,总会有一个缺点。

If image is in global memory then there is no problem - you don't need to use shared memory and you can just access pixels directly from image without any problem.

However if you have already done some processing prior to this, and a block of image is already in shared memory, then you have a problem, since you need to do neighbourhood operations which are outside the range of your block. You can do one of the following - either:

  • write shared memory back to global memory so that it is accessible to neighbouring blocks (disadvantage: performance, synchronization between blocks can be tricky)

or:

  • process additional edge pixels per block with an overlap (1 pixel in this case) so that you have additional pixels in each block to handle the edge cases, e.g. work with a 34x34 block size but store only the 32x32 central output pixels (disadvantage: requires additional logic within kernel, branches may result in warp divergence, not all threads in block are fully used)

Unfortunately neighbourhood operations can be really tricky in CUDA and there is always a down-side whatever method you use to handle edge cases.

夜清冷一曲。 2024-12-15 14:33:46

您可以使用繁忙的旋转(不是开玩笑)。只需让处理 a[32] 的线程

while(!variable);

在开始计算之前执行,并让处理 a[31] 的线程

variable = 1;

在完成时执行。由你来概括这一点。我知道这在 CUDA 中被认为是“流氓编程”,但这似乎是实现你想要的目标的唯一方法。我有一个非常相似的问题,它对我有用。不过你的表现可能会受到影响......
但请注意,这

dim3 threads_perblock(32, 32) 

意味着每个块有 32 x 32 = 1024 个线程。

You can just use a busy spin (no joke). Just make the thread processing a[32] execute:

while(!variable);

before starting to compute and the thread processing a[31] do

variable = 1;

when it finishes. It's up to you to generalize this. I know this is considered "rogue programming" in CUDA, but it seems the only way to achieve what you want. I had a very similar problem and it worked for me. Your performance might suffer though...
Be careful however, that

dim3 threads_perblock(32, 32) 

means you have 32 x 32 = 1024 threads per block.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文