如何才能最好地提高双三次插值算法的执行时间?

发布于 2024-10-14 16:12:12 字数 379 浏览 10 评论 0原文

我正在 Intel 上用 C++ 开发一些图像处理软件,它必须在小(大约 1kpx)图像上一遍又一遍地运行双三次插值算法。这需要很多时间,我的目标是加快速度。我现在拥有的是基于文献的基本实现,这是一个稍微改进的(关于速度)版本,它不进行矩阵乘法,而是使用预先计算的公式来计算插值多项式的一部分,最后是一个固定的矩阵乘法代码的点版本(实际上运行速度较慢)。我还有一个具有优化实现的外部库,但它对于我的需求来说仍然太慢。接下来我考虑的是:

  • 使用 MMX/SSE 流处理进行矢量化,在浮点和定点版本上
  • 使用卷积在傅里叶域中进行插值 使用
  • OpenCL 或类似方法将工作转移到 GPU 上

这些方法中哪一种可以产生最大的效果性能提升?你能推荐另一个吗?谢谢。

I'm developing some image processing software in C++ on Intel which has to run a bicubic interpolation algorithm on small (about 1kpx) images over and over again. This takes a lot of time, and I'm aiming to speed it up. What I have now is a basic implementation based on the literature, a somewhat-improved (with regard to speed) version which doesn't do matrix multiplication, but rather uses pre-calculated formulas for parts of the interpolating polynomial and last, a fixed-point version of the matrix-multiplying code (works slower actually). I also have an external library with an optimized implementation, but it's still too slow for my needs. What I was considering next is:

  • vectorization using MMX/SSE stream processing, on both the floating and fixed-point versions
  • doing the interpolation in the Fourier domain using convolution
  • shifting the work onto a GPU using OpenCL or similar

Which of these approaches could yield greatest performance gains? Could you suggest another? Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

大姐,你呐 2024-10-21 16:12:13

英特尔 IPP 库,它在内部使用 SIMD为了更快的处理。英特尔 IPP 还使用 OpenMP,如果进行配置,您可以获得相对简单的多处理的优势。

这些库确实支持双三次插值并且是付费软件(您购买开发许可证,但重新分发是免费的)。

There's the Intel IPP libraries, which use SIMD internally for faster processing. The Intel IPP also uses OpenMP, if configured, you can gain benefit of relatively easy multiprocessing.

These libraries do support bicubic interpolation and are payware (you buy a development license but redistribs are free).

花桑 2024-10-21 16:12:13

走 GPU 路线要小心。如果你的卷积核太快,你最终会受到 IO 限制。除非您同时实现两者,否则您将无法确定哪一个最快。

GPU Gems 2有一章关于快速三阶纹理过滤,这应该是您的 GPU 解决方案的良好起点。

英特尔线程构建模块和 SSE 指令的组合将构成一个不错的 CPU 解决方案。

Be careful with going the GPU route. If your convolution kernel is too fast, you're going to end up being IO bound. You won't know for sure which is the fastest unless you implement both.

GPU Gems 2 has a chapter on Fast Third-Order Texture Filtering which should be a good starting point for your GPU solution.

A combination of Intel Threading Building Blocks and SSE instructions would make a decent CPU solution.

蓝眸 2024-10-21 16:12:13

不是双三次的答案,但也许是替代方案:
如果我理解你的意思,你有 32 x 32 xy、1024 x 768 图像,并且需要插值图像[xy]
仅对 xy 进行四舍五入,image[ int( xy )] 颗粒感太强。
但是等等 - 您可以制作一次平滑的双图像 2k x 1.5k,然后拍摄
image2[ int( 2*xy )]:颗粒感较少,速度非常快。或者类似地,
image4[ int( 4*xy )] 在平滑的 4k x 3k 图像中。
其效果如何取决于...

Not an answer for bicubic, but maybe an alternative:
if I understand you, you have 32 x 32 xy, 1024 x 768 image, and want interpolated image[xy].
Just rounding xy, image[ int( xy )], would be too grainy.
But wait — you could make a smoothed double image 2k x 1.5k, once, and take
image2[ int( 2*xy )]: less grainy, very fast. Or similarly,
image4[ int( 4*xy )] in a smoothed 4k x 3k image.
How well this works depends on ...

白色秋天 2024-10-21 16:12:12

我认为 GPU 是一条出路。对于此类硬件来说,这可能是最自然的任务。我首先研究 CUDACUDA 。 khronos.org/opencl/" rel="nofollow">OpenCL。简单的 DirectX/OpenGL 像素/片段着色器等较旧的技术也应该可以正常工作。

我找到的一些链接,也许可以帮助您:

I think GPU is the way to go. It's probably the most natural task for this type of hardware. I would start by looking into CUDA or OpenCL. Older techniques like simple DirectX/OpenGL pixel/fragment shaders should work just fine as well.

Some links I found, maybe they could help you:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文