无符号字符图像上的快速高斯模糊 - ARM Neon Intrinsics - iOS Dev

发布于 2025-01-02 19:59:59 字数 952 浏览 1 评论 0 原文

有人可以告诉我一个使用 5x5 掩模查找图像高斯模糊的快速函数吗？我需要它用于 iOS 应用程序开发。我直接处理定义为的图像的内存，

unsigned char *image_sqr_Baseaaddr = (unsigned char *) malloc(noOfPixels);

for (row = 2; row < H-2; row++) 
{
    for (col = 2; col < W-2; col++) 
    {
        newPixel = 0;
        for (rowOffset=-2; rowOffset<=2; rowOffset++)
        {
            for (colOffset=-2; colOffset<=2; colOffset++) 
            {
                rowTotal = row + rowOffset;
                colTotal = col + colOffset;
                iOffset = (unsigned long)(rowTotal*W + colTotal);
                newPixel += (*(imgData + iOffset)) * gaussianMask[2 + rowOffset][2 + colOffset];
            }
        }
        i = (unsigned long)(row*W + col);
        *(imgData + i) = newPixel / 159;
    }
}

这显然是最慢的函数。我听说 iOS 上的 ARM Neon 内在函数可用于在 1 个周期内执行多个操作。也许这就是要走的路？

问题是我不太熟悉，目前没有足够的时间学习汇编语言。因此，如果任何人都可以发布针对上述问题的 Neon 内在函数代码或 C/C++ 中的任何其他快速实现，那就太好了。

原文

Can someone tell me a fast function to find the gaussian blur of an image using a 5x5 mask. I need it for iOS app dev. I am working directly on the memory of the image defined as

unsigned char *image_sqr_Baseaaddr = (unsigned char *) malloc(noOfPixels);

for (row = 2; row < H-2; row++) 
{
    for (col = 2; col < W-2; col++) 
    {
        newPixel = 0;
        for (rowOffset=-2; rowOffset<=2; rowOffset++)
        {
            for (colOffset=-2; colOffset<=2; colOffset++) 
            {
                rowTotal = row + rowOffset;
                colTotal = col + colOffset;
                iOffset = (unsigned long)(rowTotal*W + colTotal);
                newPixel += (*(imgData + iOffset)) * gaussianMask[2 + rowOffset][2 + colOffset];
            }
        }
        i = (unsigned long)(row*W + col);
        *(imgData + i) = newPixel / 159;
    }
}

This is obviously the slowest function possible. I heard that ARM Neon intrinsics on the iOS can be used to make several operations in 1 cycle. Maybe that's the way to go ?

The problem is that I am not very familiar and don't have enough time to learn assembly language at the moment. So it would be great if anyone can post a Neon intrinsics code for the problem mentioned above or any other fast implementation in C/C++.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦与时光遇 2025-01-09 20:00:00

在使用 NEON 进行 SIMD 优化之前，您应该首先改进标量实现。目前代码的最大问题是，它的实现就像是一个不可分离的滤波器，而高斯内核是可分离的。通过切换到可分离的实现，您可以将操作数量从 N^2 减少到 2N，在您的 5x5 内核的情况下，这将从 25 次乘加减少到 10 次，即只需很少的努力即可将速度提高 2.5 倍。

充分优化的标量实现可能会满足您的需求，而无需求助于 SIMD。如果没有，那么您至少可以将这些标量优化转移到矢量化实现中。

http://en.wikipedia.org/wiki/Gaussian_blur

http://blogs.mathworks.com/steve/2006/11/28/separable-volving-part-2/