iOS 上的高性能多 ROI 图像颜色平均

发布于 2025-01-15 01:06:46 字数 3806 浏览 3 评论 0 原文

CoreImage 的 CIAreaAverage 过滤器可轻松用于执行整个 CIImage RGB 颜色平均。例如:

let options = [CIContextOption.workingColorSpace: kCFNull as Any]
let context = CIContext(options: options)

let parameters = [
    kCIInputImageKey: inputImage, // assume this exists
    kCIInputExtentKey: CIVector(cgRect: inputImage.extent)
]

let filter = CIFilter(name: "CIAreaAverage", parameters: parameters)

var bitmap = [Float32](repeating: 0, count: 4)
context.render(filter.outputImage!, toBitmap: &bitmap, rowBytes: 16, bounds: CGRect(x: 0, y: 0, width: 1, height: 1), format: .RGBAf, colorSpace: nil)

let rAverage = bitmap[0]
let gAverage = bitmap[1]
let bAverage = bitmap[3]
...

但是假设有人这样做不是想要整个 CIImage 颜色平均,通过改变输入范围(参见上面的 kCIInputExtentKey)将图像分解为感兴趣区域 (ROI),并执行 CIAreaAverage 每个 ROI 的过滤操作引入了许多连续步骤,从而大大降低了性能。当然,滤镜不能链接,因为输出是 4 分量颜色平均值(请参见上面的位图)。描述这一点的另一种方式可能是“平均下采样”。

例如,假设您有一个 1080p 图像 (1920x1080),并且您需要从中获得 10x10 颜色平均矩阵。您将对 100 个不同的输入范围执行 100 次 CIAreaAverage 操作,每个操作对应于您希望获得 R、G、B 甚至 A 平均值的 192x108 像素 ROI。但现在这是 100 个连续的 CIAreaAverage 操作 - 性能不佳。

也许人们可能会想到做的下一件事是某种并行 for 循环,例如每个 ROI 一个 DispatchQueue.concurrentPerform(iterations:,execute:)。但是,我没有看到性能提升。 (请注意,CIContext 是线程安全的,CIFilter 不是)

从逻辑上讲,下一个想法可能是创建一个自定义CIFilter——我们称之为CIMultiAreaAverage。然而,如何创建一个可以检查源像素位置并将其映射到特定目标像素的 CIKernel 并不明显。您需要一些信息缓冲区,例如 ROI 颜色总和或将目标像素视为缓冲区。最简单的事情可能是将每个通道的 ROI 总和执行到整数类型的目标中,然后通过转换为浮点数并除以 ROI 中的像素数,将渲染到位图后的结果处理为平均值。

我希望能够访问 CIAreaAverage 的源代码。要将完整功能封装在 CIFilter 中,您可能需要进一步编写真正的自定义 Metal 着色器。因此,也许具有一定专业知识的人可以帮助您了解如何使用金属着色器来实现这一目标。

另一种选择可能是使用 vDSP/vImage 来执行这些 ROI 操作。为每个 ROI 创建必要的 vImage_Buffer 似乎很容易,但我想确保这是一个就地操作(可能)以提高性能。然后,我不确定哪个或如何将 vDSP 均值函数应用于 vImage_Buffer,如果可能的话,将其视为数组。听起来这可能是性能最高的操作。

SO 有何想法?

CoreImage's CIAreaAverage filter can easily be used to perform whole CIImage RGB color averaging. For example:

let options = [CIContextOption.workingColorSpace: kCFNull as Any]
let context = CIContext(options: options)

let parameters = [
    kCIInputImageKey: inputImage, // assume this exists
    kCIInputExtentKey: CIVector(cgRect: inputImage.extent)
]

let filter = CIFilter(name: "CIAreaAverage", parameters: parameters)

var bitmap = [Float32](repeating: 0, count: 4)
context.render(filter.outputImage!, toBitmap: &bitmap, rowBytes: 16, bounds: CGRect(x: 0, y: 0, width: 1, height: 1), format: .RGBAf, colorSpace: nil)

let rAverage = bitmap[0]
let gAverage = bitmap[1]
let bAverage = bitmap[3]
...

However supposing one does not want whole CIImage color averaging, breaking up the image into regions of interest (ROIs) by varying the input extent (see kCIInputExtentKey above), and performing CIAreaAverage filtering operations per ROI introduces many sequential steps, decreasing performance drastically. The filters cannot be chained, of course, since the output is a 4-component color average (see bitmap above). Another way of describing this might be "average downsampling".

For example, let's say you have a 1080p image (1920x1080), and you want a 10x10 color average matrix from this. You would be performing 100 CIAreaAverage operations for 100 different input extents--each corresponding to a 192x108 pixel ROI for which you wish to have R, G, B, and perhaps A, average. But this is now 100 sequential CIAreaAverage operations--not performant.

Perhaps the next thing one might think to do is some sort of parallel for loop, e.g., a DispatchQueue.concurrentPerform(iterations:, execute:) per ROI. However, I am not seeing a performance gain. (Note that CIContext is thread safe, CIFilter is not)

Logically the next idea might be to create a custom CIFilter--let's call it CIMultiAreaAverage. However, it's not obvious how to create a CIKernel that can examine a source pixel's location and map that to a particular destination pixel. You need some buffer of information such as ROI color sum or to treat the destination pixel as a buffer. The simplest thing might be to perform ROI per channel sum into a destination with integer type, and then process that once rendered to a bitmap into an average by casting to float and dividing by the number of pixels in the ROI.

I wish I had access to the source code for CIAreaAverage. To encapsulate the full functionality in the CIFilter you might have to go further and write what's really a custom Metal shader. So perhaps someone with some expertise can assist with how to accomplish this with a metal shader.

Another option might be to use vDSP/vImage to perform these ROI operations. It seems easy to create the necessary vImage_Buffers per ROI, but I'd want to make sure that's an in-place operation (probably) for performance. Then, I'm not sure which or how to apply a vDSP mean function to the vImage_Buffer, treating it like an array, if that's possible. It sounds like this might be the most performant operation.

What does SO think?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

吹泡泡o 2025-01-22 01:06:46

以下是 Apple 在 CIAreaAverage 中所做的事情:

CIAreaAverage 的过滤图

我不知道为什么它们遵循两条不同的路径,但这就是我认为正在发生的情况:

左侧的路径是将输入像素逐步减少为更小的输出。 kernel _areaAvg8 通过计算一组(最多)8x8 像素的平均值将其减少为一个输出像素。 _areaAvg2 对于 2x2 像素执行相同的操作,_horizAvg2 对于 2x1 像素执行相同的操作。因此,在多个步骤中,图像会被缩小,每一步都会进一步缩小前一步的值。直到最后一步产生一个最终像素,其中包含输入所有像素的平均值。

对于右侧,我假设 CIAreaAverageProcessor 是一个使用 Metal Performance Shaders 的 CIImageProcessingKernel ,特别是我假设 MPSImageReduceRowMean 和 MPSImageReduceColumnMean /code>,做同样的事情。我不知道为什么他们有这两条路径,并且开关位于顶部。

对于您的用例,我建议您实现与左侧路径类似的内容,但根据所需输出的大小停在中间的某个位置。

为了提高性能,您可以利用图形硬件基本上免费提供的双线性采样:当您在 4 个像素中间的坐标处对输入图像进行采样时,您已经获得了这 4 个颜色值的平均值。这意味着要减少 8x8,您只需要 4 x 4 = 16 个样本操作(而不是 64 个)。这个内核可能看起来像这样:

extern "C" float4 areaAvg8(coreimage::sampler src, coreimage::destination dest) {
    float2 center = dest.coord() * 8.0; // assuming that src is 8x larger than dest
    float4 sum = src.sample(src.transform(center + float2(-3.0, -3.0)))
               + src.sample(src.transform(center + float2(-1.0, -3.0)))
               + src.sample(src.transform(center + float2( 1.0, -3.0)))
               + src.sample(src.transform(center + float2( 3.0, -3.0)))
               + src.sample(src.transform(center + float2(-3.0, -1.0)))
               + src.sample(src.transform(center + float2(-1.0, -1.0)))
               + src.sample(src.transform(center + float2( 1.0, -1.0)))
               + src.sample(src.transform(center + float2( 3.0, -1.0)))
               + src.sample(src.transform(center + float2(-3.0,  1.0)))
               + src.sample(src.transform(center + float2(-1.0,  1.0)))
               + src.sample(src.transform(center + float2( 1.0,  1.0)))
               + src.sample(src.transform(center + float2( 3.0,  1.0)))
               + src.sample(src.transform(center + float2(-3.0,  3.0)))
               + src.sample(src.transform(center + float2(-1.0,  3.0)))
               + src.sample(src.transform(center + float2( 1.0,  3.0)))
               + src.sample(src.transform(center + float2( 3.0,  3.0)));
    return sum / 16.0;
}

Here is what Apple is doing in CIAreaAverage:

Filter graph for CIAreaAverage

I don't know why they follow two different paths, but this is what I think is happening:

The path on the left is a stepwise reduction of the input pixels into a smaller output. The kernel _areaAvg8 reduces a group of (up to) 8x8 pixels into one output pixel by calculating their average value. _areaAvg2 does the same for 2x2 pixels and _horizAvg2 for 2x1. So in multiple steps, the image is reduced, each step reducing the values of the previous step further. Until the last step produces one final pixel that contains the average of all pixels of the input.

For the right side, I assume that the CIAreaAverageProcessor is a CIImageProcessingKernel that uses Metal Performance Shaders, specifically I assume MPSImageReduceRowMean and MPSImageReduceColumnMean, to do the same. Why they have those two paths with the switch on top I do not know.

For your use case, I suggest you implement something similar to the left path, but stop somewhere in the middle, depending on the size of your desired output.

To improve performance, you can make use of the bilinear sampling that is provided by the graphics hardware basically for free: When you sample the input image at a coordinate in the middle of 4 pixels, you already get an average of these 4 color values. That means for an 8x8 reduction, you only need 4 x 4 = 16 sample operations (instead of 64). This kernel could look something like this:

extern "C" float4 areaAvg8(coreimage::sampler src, coreimage::destination dest) {
    float2 center = dest.coord() * 8.0; // assuming that src is 8x larger than dest
    float4 sum = src.sample(src.transform(center + float2(-3.0, -3.0)))
               + src.sample(src.transform(center + float2(-1.0, -3.0)))
               + src.sample(src.transform(center + float2( 1.0, -3.0)))
               + src.sample(src.transform(center + float2( 3.0, -3.0)))
               + src.sample(src.transform(center + float2(-3.0, -1.0)))
               + src.sample(src.transform(center + float2(-1.0, -1.0)))
               + src.sample(src.transform(center + float2( 1.0, -1.0)))
               + src.sample(src.transform(center + float2( 3.0, -1.0)))
               + src.sample(src.transform(center + float2(-3.0,  1.0)))
               + src.sample(src.transform(center + float2(-1.0,  1.0)))
               + src.sample(src.transform(center + float2( 1.0,  1.0)))
               + src.sample(src.transform(center + float2( 3.0,  1.0)))
               + src.sample(src.transform(center + float2(-3.0,  3.0)))
               + src.sample(src.transform(center + float2(-1.0,  3.0)))
               + src.sample(src.transform(center + float2( 1.0,  3.0)))
               + src.sample(src.transform(center + float2( 3.0,  3.0)));
    return sum / 16.0;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文