处理 CUDA 中的边界条件/Halo 区域

发布于 2024-11-02 09:18:52 字数 567 浏览 6 评论 0原文

我正在使用 CUDA 进行图像处理,但我对像素处理有疑问。

应用 mx m 卷积滤波器时,通常会对图像的边界像素做什么?

3 x 3 卷积核中,忽略图像的 1 像素边界更容易处理,特别是当使用共享内存改进代码时。事实上,在这种情况下,不需要检查给定像素是否具有所有可用的邻居(即坐标(0, 0)处的像素没有左、左上、上邻居)。但是,删除原始图像的 1 像素边界可能会生成部分结果。

与此相反,我想处理图像中的所有像素,也在使用共享内存改进时,即加载16 x 16像素,但是计算内部 14 x 14。同样在这种情况下,忽略边界像素会生成更清晰的代码。

在这种情况下通常会做什么?

有人通常使用我的方法忽略边界像素吗?

当然,我知道答案取决于问题的类型,即按像素添加两个图像没有这个问题。

提前致谢。

I'm working on image processing with CUDA and i've a doubt about pixel processing.

What is often done with the boundary pixels of an image when applying a m x m convolution filter?

In a 3 x 3 convolution kernel, ignoring the 1 pixel boundary of the image is easier to deal with, especially when the code is improved with shared memory. Indeed, in this case, one does not need to check if a given pixel has all the neigbourhood available (i.e. pixel at coord (0, 0) has not left, left-upper, upper neighbours). However, removing the 1 pixel boundary of the original image could generate partial results.

Opposite to that, I'd like to process all the pixels within the image, also when using shared memory improvements, i.e., for example, loading 16 x 16 pixels, but computing the inner 14 x 14. Also in this case, ignoring the boundary pixels generates a clearer code.

What is usually done in this case?

Does anyone usually use my approach ignoring the boundary pixels?

Of course, I'm aware the answer depends on the type of problem, i.e. adding two images pixel-wise has not this problem.

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

甜是你 2024-11-09 09:18:53

处理边框效果的常见方法是用额外的行和边框填充原始图像。基于您的过滤器大小的列。填充值的一些常见选择是:

  • 常量(例如零)
  • 根据需要多次复制第一行和最后一行/列
  • 在边界处反映图像(例如,列[-1] = 列[1]、列[- 2] = 列[2])
  • 换行图像值(例如,列[-1] = 列[宽度-1],列[-2] = 列[宽度-2])

A common approach to dealing with border effects is to pad the original image with extra rows & columns based on your filter size. Some common choices for the padded values are:

  • A constant (e.g. zero)
  • Replicate the first and last row / column as many times as needed
  • Reflect the image at the borders (e.g. column[-1] = column[1], column[-2] = column[2])
  • Wrap the image values (e.g. column[-1] = column[width-1], column[-2] = column[width-2])
他是夢罘是命 2024-11-09 09:18:53

tl;dr:这取决于您要解决的问题——没有适用于所有问题的解决方案。事实上,从数学上来说,我怀疑可能根本没有“解决方案”,因为我相信这是一个你被迫处理的不适定问题。

(提前为我鲁莽地滥用数学表示歉意)

为了演示,让我们考虑假设所有像素分量和内核值都为正的情况。为了了解其中一些答案如何可能使我们误入歧途,让我们进一步考虑一个简单的平均(“盒子”)过滤器。如果我们将图像边界之外的值设置为零,那么这将明显拉低边界 ceil(n/2)(曼哈顿距离)内每个像素的平均值。因此,您将在过滤后的图像上看到一个“暗”边框(假设单个强度分量或 RGB 色彩空间 - 您的结果将因色彩空间而异!)。请注意,如果我们将边界之外的值设置为任意常数,则可以得出类似的论点 - 平均值将趋向于该常数。如果典型图像的边缘无论如何都趋向于 0,则常量为零可能是合适的。如果我们考虑更复杂的滤波器核(例如高斯),情况也是如此,但问题不会那么明显,因为核值往往会随着距中心的距离而快速减小。

现在假设我们选择重复边缘值,而不是使用常量。这与在图像周围创建边框并复制行、列或角足够多次以确保过滤器保持在新图像“内部”相同。您也可以将其视为夹紧/饱和样本坐标。这对于我们的简单盒式过滤器来说存在问题,因为它过分强调边缘像素的值。一组边缘像素会出现多次,但它们都具有相同的权重w=(1/(n*n))
假设我们对值为 K 的边缘像素进行 3 次采样。这意味着它对平均值的贡献是:

K*w + K*w + K*w  = K*3*w

非常有效,以至于一个像素在平均值中具有更高的权重。请注意,由于这是一个平均滤波器,因此权重在内核上是一个常数。然而,这个论点也适用于权重随位置而变化的内核(再次:想想高斯内核......)。

假设我们环绕或反射采样坐标,以便我们仍然使用图像边界内的值。与使用常量相比,这具有一些有价值的优点,但也不一定是“正确的”。例如,您拍摄了多少张上边框物体与下边框物体相似的照片?除非你拍摄镜面光滑的湖泊的照片,否则我怀疑这是真的。如果您正在拍摄岩石的照片以用作游戏中的纹理,则环绕或反射可能是合适的。我确信这里有一些重要的要点需要说明,关于环绕和反射如何可能减少使用傅立叶变换产生的任何伪影。然而,这又回到了同样的想法:您有一个周期性信号,您不希望通过引入虚假的新频率或高估现有频率的幅度来扭曲该信号。

那么,如果您要过滤蓝天下鲜红色岩石的照片,该怎么办?显然,您不想在蓝色的天空中添加橙色的薄雾,在红色的岩石上添加蓝色的绒毛。反射样本坐标是有效的,因为我们期望与反射坐标处找到的像素颜色相似......除非,只是为了论证,我们想象过滤器内核太大,以至于反射坐标会延伸到地平线之外。

让我们回到盒式过滤器的例子。此过滤器的另一种选择是停止考虑使用静态内核并回想该内核的用途。平均/箱式滤波器被设计为对像素分量求和,然后除以求和的像素数。这个想法是这样可以消除噪音。如果我们愿意以降低边界附近噪声抑制效率为代价,我们可以简单地对更少的像素求和并除以相应更小的数字。这可以扩展到具有类似的“标准化”术语的过滤器——与过滤器的面积或体积相关的术语。对于“面积”项,您计算边界内的内核权重的数量,并忽略不在边界内的那些权重。然后使用这个计数作为“面积”(这可能涉及额外的乘法)。对于体积(再次:假设正权重!),只需将内核权重相加即可。对于导数滤波器来说,这个想法可能很糟糕,因为与噪声像素竞争的像素较少,而且差分对噪声非常敏感。此外,一些过滤器是通过数值优化和/或经验数据而不是从头开始/分析方法导出的,因此可能缺乏明显的“标准化”因子。

tl;dr: It depends on the problem you're trying to solve -- there is no solution for this that applies to all problems. In fact, mathematically speaking, I suspect there may be no "solution" at all since I believe it's an ill-posed problem you're forced to deal with.

(Apologies in advance for my reckless abuse of mathematics)

To demonstrate let's consider a situation where all pixel components and kernel values are assumed to be positive. To get an idea of how some of these answers could lead us astray let's further think about a simple averaging ("box") filter. If we set values outside the boundary of the image to zero then this will clearly drag down the average at every pixel within ceil(n/2) (manhattan distance) of the boundary. So you'll get a "dark" border on your filtered image (assuming a single intensity component or RGB colorspace -- your results will vary by colorspace!). Note that similar arguments can be made if we set the values outside the boundary to any arbitrary constant -- the average will tend towards that constant. A constant of zero might be appropriate if the edges of your typical image tend towards 0 anyway. This is also true if we consider more complex filter kernels like a gaussian however the problem will be less pronounced because the kernel values tend to decrease quickly with distance from the center.

Now suppose that instead of using a constant we choose to repeat the edge values. This is the same as making a border around the image and copying rows, columns, or corners enough times to ensure the filter stays "inside" the new image. You could also think of it as clamping/saturating the sample coordinates. This has problems with our simple box filter because it overemphasizes the values of the edge pixels. A set of edge pixels will appear more than once yet they all receive the same weight w=(1/(n*n)).
Suppose we sample an edge pixel with value K 3 times. That means its contribution to the average is:

K*w + K*w + K*w  = K*3*w

So effectively that one pixel has a higher weight in the average. Note that since this is an average filter the weight is a constant over the kernel. However this argument applies to kernels with weights that vary by position too (again: think of the gaussian kernel..).

Suppose we wrap or reflect the sampling coordinates so that we're still using values from within the boundary of the image. This has some valuable advantages over using a constant but isn't necessarily "correct" either. For instance, how many photos do you take where the objects at the upper border are similar to those at the bottom? Unless you're taking pictures of mirror-smooth lakes I doubt this is true. If you're taking pictures of rocks to use as textures in games wrapping or reflecting could be appropriate. I'm sure there are significant points to be made here about how wrapping and reflecting will likely reduce any artifacts that result from using a fourier transform. However this comes back to the same idea: that you have a periodic signal which you do not wish to distort by introducing spurious new frequencies or overestimating the amplitude of existing frequencies.

So what can you do if you're filtering photos of bright red rocks beneath a blue sky? Clearly you don't want to add orange-ish haze in the blue sky and blue-ish fuzz on the red rocks. Reflecting the sample coordinate works because we expect similar colors to those pixels found at the reflected coordinates... unless, just for the sake of argument, we imagine the filter kernel is so big that the reflected coordinate would extend past the horizon.

Let's go back to the box filter example. An alternative with this filter is to stop thinking about using a static kernel and think back to what this kernel was meant to do. An averaging/box filter is designed to sum the pixel components then divide by the number of pixels summed. The idea is that this smooths out noise. If we're willing to trade a reduced effectiveness in suppressing noise near the boundary we can simply sum fewer pixels and divide by a correspondingly smaller number. This can be extended to filters with similar what-I-will-call-"normalizing" terms -- terms that are related to the area or volume of the filter. For "area" terms you count the number of kernel weights that are within the boundary and ignore those weights that are not. Then use this count as the "area" (which might involve a extra multiplication). For volume (again: assuming positive weights!) simply sum the kernel weights. This idea is probably awful for derivative filters because there are fewer pixels to compete with the noisy pixels and differentials are notoriously sensitive to noise. Also, some filters have been derived by numeric optimization and/or empirical data rather than from ab-initio/analytic methods and thus may lack a readily apparent "normalizing" factor.

傲鸠 2024-11-09 09:18:53

你的问题有点宽泛,我相信它混合了两个问题:

  1. 处理边界条件
  2. 处理光环区域

例如,在计算图像与 3 x 3 内核之间的卷积时,会遇到第一个问题(边界条件)。当卷积窗口跨越边界时,就会出现将图像扩展到其边界之外的问题。

例如,当在共享内存中加载 16 x 16 图块并且必须处理内部 14 x 14 时,会遇到第二个问题(光环区域 平铺来计算二阶导数。

对于第二个问题,我认为一个有用的问题如下:分析内存访问我的 CUDA 内核的合并

关于信号超出其边界的扩展,在这种情况下,由于提供了不同的寻址模式,纹理存储器提供了一个有用的工具,请参阅CUDA 纹理的不同寻址模式

下面,我提供了一个示例,说明如何使用纹理内存通过周期性边界条件来实现中值滤波器。

#include <stdio.h>

#include "TimingGPU.cuh"
#include "Utilities.cuh"

texture<float, 1, cudaReadModeElementType> signal_texture;

#define BLOCKSIZE 32

/*************************************************/
/* KERNEL FUNCTION FOR MEDIAN FILTER CALCULATION */
/*************************************************/
__global__ void median_filter_periodic_boundary(float * __restrict__ d_vec, const unsigned int N){

    unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {

        float signal_center = tex1D(signal_texture, tid - 0);
        float signal_before = tex1D(signal_texture, tid - 1);
        float signal_after  = tex1D(signal_texture, tid + 1);

        printf("%i %f %f %f\n", tid, signal_before, signal_center, signal_after);

        d_vec[tid] = (signal_center + signal_before + signal_after) / 3.f;

    }
}


/********/
/* MAIN */
/********/
int main() {

    const int N = 10;

    // --- Input host array declaration and initialization
    float *h_arr = (float *)malloc(N * sizeof(float));
    for (int i = 0; i < N; i++) h_arr[i] = (float)i;

    // --- Output host and device array vectors
    float *h_vec = (float *)malloc(N * sizeof(float));
    float *d_vec;   gpuErrchk(cudaMalloc(&d_vec, N * sizeof(float)));

    // --- CUDA array declaration and texture memory binding; CUDA array initialization
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
    //Alternatively
    //cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);

    cudaArray *d_arr;   gpuErrchk(cudaMallocArray(&d_arr, &channelDesc, N, 1));
    gpuErrchk(cudaMemcpyToArray(d_arr, 0, 0, h_arr, N * sizeof(float), cudaMemcpyHostToDevice));

    cudaBindTextureToArray(signal_texture, d_arr); 
    signal_texture.normalized = false; 
    signal_texture.addressMode[0] = cudaAddressModeWrap;

    // --- Kernel execution
    median_filter_periodic_boundary<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_vec, N);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    gpuErrchk(cudaMemcpy(h_vec, d_vec, N * sizeof(float), cudaMemcpyDeviceToHost));

    for (int i=0; i<N; i++) printf("h_vec[%i] = %f\n", i, h_vec[i]);

    printf("Test finished\n");

    return 0;
}

Your question is somewhat broad and I believe it mixes two problems:

  1. dealing with boundary conditions;
  2. dealing with halo regions.

The first problem (boundary conditions) is encountered, for example, when computing the convolution between and image and a 3 x 3 kernel. When the convolution window comes across the boundary, one has the problem of extending the image outside of its boundaries.

The second problem (halo regions) is encountered, for example, when loading a 16 x 16 tile within shared memory and one has to process the internal 14 x 14 tile to compute second order derivatives.

For the second issue, I think a useful question is the following: Analyzing memory access coalescing of my CUDA kernel.

Concerning the extension of a signal outside of its boundaries, a useful tool is provided in this case by texture memory thanks to the different provided addressing modes, see The different addressing modes of CUDA textures.

Below, I'm providing an example on how a median filter can be implemented with periodic boundary conditions using texture memory.

#include <stdio.h>

#include "TimingGPU.cuh"
#include "Utilities.cuh"

texture<float, 1, cudaReadModeElementType> signal_texture;

#define BLOCKSIZE 32

/*************************************************/
/* KERNEL FUNCTION FOR MEDIAN FILTER CALCULATION */
/*************************************************/
__global__ void median_filter_periodic_boundary(float * __restrict__ d_vec, const unsigned int N){

    unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {

        float signal_center = tex1D(signal_texture, tid - 0);
        float signal_before = tex1D(signal_texture, tid - 1);
        float signal_after  = tex1D(signal_texture, tid + 1);

        printf("%i %f %f %f\n", tid, signal_before, signal_center, signal_after);

        d_vec[tid] = (signal_center + signal_before + signal_after) / 3.f;

    }
}


/********/
/* MAIN */
/********/
int main() {

    const int N = 10;

    // --- Input host array declaration and initialization
    float *h_arr = (float *)malloc(N * sizeof(float));
    for (int i = 0; i < N; i++) h_arr[i] = (float)i;

    // --- Output host and device array vectors
    float *h_vec = (float *)malloc(N * sizeof(float));
    float *d_vec;   gpuErrchk(cudaMalloc(&d_vec, N * sizeof(float)));

    // --- CUDA array declaration and texture memory binding; CUDA array initialization
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
    //Alternatively
    //cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);

    cudaArray *d_arr;   gpuErrchk(cudaMallocArray(&d_arr, &channelDesc, N, 1));
    gpuErrchk(cudaMemcpyToArray(d_arr, 0, 0, h_arr, N * sizeof(float), cudaMemcpyHostToDevice));

    cudaBindTextureToArray(signal_texture, d_arr); 
    signal_texture.normalized = false; 
    signal_texture.addressMode[0] = cudaAddressModeWrap;

    // --- Kernel execution
    median_filter_periodic_boundary<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_vec, N);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    gpuErrchk(cudaMemcpy(h_vec, d_vec, N * sizeof(float), cudaMemcpyDeviceToHost));

    for (int i=0; i<N; i++) printf("h_vec[%i] = %f\n", i, h_vec[i]);

    printf("Test finished\n");

    return 0;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文