用于传播像素的大规模并行算法

发布于 2024-09-07 02:46:15 字数 415 浏览 2 评论 0原文

我正在设计一个 CUDA 应用程序来处理一些视频。我使用的算法要求以与康威的生命游戏不同的方式填充空白像素:如果另一个像素周围的像素全部被填充并且所有相似的值,则特定像素将被周围的值填充。此迭代直到所有要修复的像素数等于上次迭代中要修复的像素数(即,当无其他可做时)。

我的困境是这样的:处理管道的前一部分和后一部分都是在 GPU 上的 CUDA 中实现的。将整个图像传输回 RAM、在 CPU 上处理、然后传输回 GPU 的成本会很高。即使速度较慢,我也想在 CUDA 中实现该算法。

然而,这个问题的本质需要所有线程之间的同步,以在每次迭代之间更新全局图像。我想过多次为每次迭代调用内核,但我无法确定该过程何时“完成”,除非我在每次迭代之间将数据传输回 CPU,这会由于通过内存传输延迟而导致效率低下。 PCI-e接口。

有并行算法经验的人有什么建议吗?提前致谢。

I'm designing a CUDA app to process some video. The algorithm I'm using calls for filling in blank pixels in a way that's not unlike Conway's game of life: if the pixels around another pixels are all filled and all of similar values, the specific pixel gets filled in with the surrounding value. This iterates until all the number of pixels to fix is equal to the number of pixels to fix in the last iteration (ie, when nothing else can be done).

My quandary is this: the previous and next part of the processing pipeline are both implemented in CUDA on the GPU. It would be expensive to transfer the entire image back to RAM, process it on the CPU, then transfer it back to the GPU. Even if it's slower, I would like to implement the algorithm in CUDA.

However, the nature of this problem requires synchronization between all threads to update the global image between each iteration. I thought about just calling the Kernel for each iteration multiple times, but I cannot determine when the process is "done" unless I transfer data back to the CPU between each iteration, which would introduce a large inefficiency because of the memory transfer latency through the PCI-e interface.

Does anyone with some experience with parallel algorithms have any suggestions? Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

若水般的淡然安静女子 2024-09-14 02:46:18

听起来您需要一个额外的图像缓冲区,以便您可以将未修改的输入图像保留在一个缓冲区中,并将处理后的输出图像写入第二个缓冲区。这样,每个线程都可以处理单个输出像素(或小块输出像素),而不必担心同步等问题。

It sounds like you need an extra image buffer, so that you can keep the unmodified input image in one buffer and write the processed output image into the second buffer. That way each thread can process a single output pixel (or small block of output pixels) without worrying about synchronization etc.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文