GPU 适合基于案例的图像过滤吗?
我试图弄清楚某个问题是否适合使用 CUDA 将问题放在 GPU 上。
我本质上是在做一个根据某些边缘检测而变化的盒式过滤器。因此,基本上每个像素都会测试 8 种情况,然后进行其余的操作 - 典型的平均值计算等。我的循环中存在这些 switch 语句是否会导致这个问题不适合使用 GPU?
我不确定如何避免 switch 语句,因为这种边缘检测必须在每个像素上发生。我想整个图像可以将边缘检测部分从处理算法中分离出来,并且您可以存储与每个像素使用哪个过滤器相对应的缓冲区,但这似乎会为算法添加大量预处理。
编辑:只是为了提供一些背景信息 - 该算法已经编写完毕,并且 OpenMP 已用于加速它,效果非常好。然而,我的开发盒上的 8 个核心与 GPU 中的 512 个核心相比相形见绌。
I am trying to figure out whether a certain problem is a good candidate for using CUDA to put the problem on a GPU.
I am essentially doing a box filter that changes based on some edge detection. So there are basically 8 cases that are tested for for each pixel, and then the rest of the operations happen - typical mean calculations and such. Is the presence of these switch statements in my loop going to cause this problem to be a bad candidate to go to GPU?
I am not sure really how to avoid the switch statements, because this edge detection has to happen at every pixel. I suppose the entire image could have the edge detection part split out from the processing algorithm, and you could store a buffer corresponding to which filter to use for each pixel, but that seems like it would add a lot of pre-processing to the algorithm.
Edit: Just to give some context - this algorithm is already written, and OpenMP has been used to pretty good effect at speeding it up. However, the 8 cores on my development box pales in comparison to the 512 in the GPU.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
边缘检测、均值计算和互相关可以作为 2D 卷积来实现。卷积可以在 GPU 上非常有效地实现(加速 > 10,高达 100相对于 CPU),特别是对于大内核。所以,是的,在 GPU 上重写图像过滤可能是有意义的。
虽然我不会使用 GPU 作为这种方法的开发平台。
Edge detection, mean calculations and cross-correlation can be implemented as 2D convolutions. Convolutions can be implemented on GPU very effectively (speed-up > 10, up to 100 with respect to CPU), especially for large kernels. So yes, it may make sense rewriting image filtering on GPU.
Though I wouldn't use GPU as a development platform for such a method.
通常,除非您使用新的 CUDA 架构,否则您将希望避免分支。因为 GPU 基本上是 SIMD 机器,所以管道非常容易受到分支错误预测导致的管道停顿的影响。
如果您认为使用 GPU 可以带来显着的好处,请进行一些初步基准测试以获得一个大概的想法。
如果您想了解一些有关如何编写非分支代码的信息,请访问 http://cellperformance.beyond3d.com / 看看吧。
此外,研究在多个 CPU 核心上运行此问题可能也是值得的,在这种情况下,您可能需要研究 OpenCL 或 Intel 性能库(例如 TBB),
这是针对 GPU 的问题的另一个首选来源无论是图形、计算几何还是其他领域,IDAV(数据分析和可视化研究所):http://idav.ucdavis.edu< /a>
typically, unless you are on the new CUDA architecture, you will want to avoid branching. because GPUs are basically SIMD machines, the pipleline is extremely vurnurable to, and suffers tremendously from, pipeline stalls due to branch misprediction.
if you think that there are significant benefits to be garnered by using a GPU, do some preliminary benchmarks to get a rough idea.
if you want to learn a bit about how to write non-branching code, head over to http://cellperformance.beyond3d.com/ and have a look.
further, investigating into running this problem on multiple CPU cores might also be worth it, in which case you will probably want to look into either OpenCL or the Intel performance libraries (such as TBB)
another go-to source for problems targeting the GPU be it graphics, computational geometry or otherwise, is IDAV, the Institute for Data Analysis and Visualization: http://idav.ucdavis.edu
如果分支中存在空间连贯性,那么分支实际上并没有那么糟糕。换句话说,如果您期望图像中彼此相邻的像素块经过同一分支,那么对性能的影响就会最小化。
Branching is actually not that bad, if there is spatial coherence in the branching. In other words, if you are expecting chunks of pixels next to each other in the image to go through the same branch, the performance hit is minimized.
使用 GPU 进行处理通常是违反直觉的;如果用普通串行代码完成,效率显然很低,但实际上是使用 GPU 并行完成的最佳方法。
下面的伪代码看起来效率很低(因为它为每个像素计算 8 个过滤值),但可以在 GPU 上高效运行:
希望这有帮助!
Using a GPU for processing can often be counter-intuitive; things that are obviously inefficient if done in normal serial code, are actually the best way to do it in parallel using the GPU.
The pseudo-code below looks inefficient (since it computes 8 filtered values for every pixel) but will run efficiently on a GPU:
Hopefully that helps!
是的,控制流通常会对 GPU 造成性能损失,无论是 if / switch / 三元运算符,因为对于控制流操作,GPU 无法以最佳方式运行线程。所以通常的策略是尽可能避免分支。在某些情况下,IF 可以用某些公式代替,其中 IF 条件映射到公式系数。但具体的解决方案/优化取决于具体的GPU内核......也许你可以展示确切的代码,以供stackoverflow社区进一步分析。
编辑:
以防万一您对此感兴趣,这是我编写的卷积像素着色器。
Yep, control flow usually has performance penalty on GPU, be it if's / switch'es / ternary operator's, because with control flow operations GPU can't optimally run threads. So usual tactics is to avoid branching as possible. In some cases IF's can be replaced by some formula, where IF conditions maps to formula coefficients. But concrete solution/optimization depends on concrete GPU kernel... Maybe you can show exact code, to be analyzed further by stackoverflow community.
EDIT:
Just in case you are interested here is convolution pixel shader that i wrote.