具有该块的输出的cuda过滤器是下一个块的输入
在处理以下过滤器时,我在执行这些代码段以在 GPU 中处理图像时遇到问题:
for(int h=0; h<height; h++) {
for(int w=1; w<width; w++) {
image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
}
}
如果我定义:
dim3threads_perblock(32, 32)
那么我拥有的每个块:32个线程可以进行通信。该块的线程不能与其他块的线程通信。
在 thread_block 中,我可以使用共享内存来翻译该代码片段,但是,对于边缘(我想说):不同线程块中的 image[0,31] 和 image[0,32] 。 image[0,31] 应该从 image[0,32] 获取值来计算其值。但它们位于不同的线程块中。
所以这就是问题所在。
我该如何解决这个问题?
提前致谢。
Working on a filter following, I am having a problem of doing these pieces of codes for processing an image in GPU:
for(int h=0; h<height; h++) {
for(int w=1; w<width; w++) {
image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
}
}
If I define:
dim3 threads_perblock(32, 32)
then each block I have: 32 threads can be communicated. The threads of this block can not communicate with the threads from other blocks.
Within a thread_block, I can translate that pieces of code using shared_memory however, for edge (I would say): image[0,31] and image[0,32] in different threadblocks. The image[0,31] should get value from image[0,32] to calculate its value. But they are in different threadblocks.
so that is the problem.
How would I solve this?
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果
image
位于全局内存中,那么就没有问题 - 您不需要使用共享内存,并且可以直接从image
访问像素,没有任何问题。但是,如果您在此之前已经完成了一些处理,并且
image
块已经在共享内存中,那么您就会遇到问题,因为您需要执行超出您的范围的邻域操作。堵塞。您可以执行以下操作之一 - 要么:,或者:
邻域操作在 CUDA 中可能非常棘手,并且无论使用什么方法来处理边缘情况,总会有一个缺点。
If
image
is in global memory then there is no problem - you don't need to use shared memory and you can just access pixels directly fromimage
without any problem.However if you have already done some processing prior to this, and a block of
image
is already in shared memory, then you have a problem, since you need to do neighbourhood operations which are outside the range of your block. You can do one of the following - either:or:
Unfortunately neighbourhood operations can be really tricky in CUDA and there is always a down-side whatever method you use to handle edge cases.
您可以使用繁忙的旋转(不是开玩笑)。只需让处理 a[32] 的线程
在开始计算之前执行,并让处理 a[31] 的线程
在完成时执行。由你来概括这一点。我知道这在 CUDA 中被认为是“流氓编程”,但这似乎是实现你想要的目标的唯一方法。我有一个非常相似的问题,它对我有用。不过你的表现可能会受到影响......
但请注意,这
意味着每个块有 32 x 32 = 1024 个线程。
You can just use a busy spin (no joke). Just make the thread processing a[32] execute:
before starting to compute and the thread processing a[31] do
when it finishes. It's up to you to generalize this. I know this is considered "rogue programming" in CUDA, but it seems the only way to achieve what you want. I had a very similar problem and it worked for me. Your performance might suffer though...
Be careful however, that
means you have 32 x 32 = 1024 threads per block.