出于我的目的,我从互联网上复制并实现了基本(独立处理 XY 轴)方法和 David Everly 的快速高斯模糊方法。 它们的参数不同,所以我无法直接比较它们。 然而,对于大的模糊半径,后者经历的迭代次数要少得多。 而且,后者是一种近似算法。
I found Quasimondo : Incubator : Processing : Fast Gaussian Blur. This method contains a lot of approximations like using integers and look up tables instead of floats and floating point divisions. I don't know how much speedup that is in modern Java code.
Also, Fast Gaussian Blur (PDF) by David Everly has a fast method for Gaussian blur processing.
I would try out the various methods, benchmark them and post the results here.
For my purposes, I have copied and implemented the basic (process X-Y axis independently) method and David Everly's Fast Gaussian Blur method from the Internet. They differ in parameters, so I couldn't compare them directly. However the latter goes through much fewer number of iterations for a large blur radius. Also, the latter is an approximate algorithm.
Answering this old question with new libraries that have been implemented now(as of 2016), because there are many new advancements to the GPU technology with Java.
As suggested in few other answers, CUDA is an alternative. But java has CUDA support now.
IBM CUDA4J library: provides a Java API for managing and accessing GPU devices, libraries, kernels, and memory. Using these new APIs it is possible to write Java programs that manage GPU device characteristics and offload work to the GPU with the convenience of the Java memory model, exceptions, and automatic resource management.
Jcuda: Java bindings for NVIDIA CUDA and related libraries. With JCuda it is possible to interact with the CUDA runtime and driver API from Java programs.
Aparapi: allows Java developers to take advantage of the compute power of GPU and APU devices by executing data parallel code fragments on the GPU rather than being confined to the local CPU.
Some Java OpenCL binding libraries
https://github.com/ochafik/JavaCL : Java bindings for OpenCL: An object-oriented OpenCL library, based on auto-generated low-level bindings
http://jogamp.org/jocl/www/ : Java bindings for OpenCL: An object-oriented OpenCL library, based on auto-generated low-level bindings
http://www.lwjgl.org/ : Java bindings for OpenCL: Auto-generated low-level bindings and object-oriented convenience classes
http://jocl.org/ : Java bindings for OpenCL: Low-level bindings that are a 1:1 mapping of the original OpenCL API
All these above libraries will help in implementing Gaussian Blur faster than any implementation in Java on CPU.
For larger blur radiuses, try applying a box blur three times. This will approximate a Gaussian blur very well, and be much faster than a true Gaussian blur.
@Override
public BufferedImage ProcessImage(BufferedImage image) {
int width = image.getWidth();
int height = image.getHeight();
int[] pixels = image.getRGB(0, 0, width, height, null, 0, width);
int[] changedPixels = new int[pixels.length];
FastGaussianBlur(pixels, changedPixels, width, height, 12);
BufferedImage newImage = new BufferedImage(width, height, image.getType());
newImage.setRGB(0, 0, width, height, changedPixels, 0, width);
return newImage;
}
private void FastGaussianBlur(int[] source, int[] output, int width, int height, int radius) {
ArrayList<Integer> gaussianBoxes = CreateGausianBoxes(radius, 3);
BoxBlur(source, output, width, height, (gaussianBoxes.get(0) - 1) / 2);
BoxBlur(output, source, width, height, (gaussianBoxes.get(1) - 1) / 2);
BoxBlur(source, output, width, height, (gaussianBoxes.get(2) - 1) / 2);
}
private ArrayList<Integer> CreateGausianBoxes(double sigma, int n) {
double idealFilterWidth = Math.sqrt((12 * sigma * sigma / n) + 1);
int filterWidth = (int) Math.floor(idealFilterWidth);
if (filterWidth % 2 == 0) {
filterWidth--;
}
int filterWidthU = filterWidth + 2;
double mIdeal = (12 * sigma * sigma - n * filterWidth * filterWidth - 4 * n * filterWidth - 3 * n) / (-4 * filterWidth - 4);
double m = Math.round(mIdeal);
ArrayList<Integer> result = new ArrayList<>();
for (int i = 0; i < n; i++) {
result.add(i < m ? filterWidth : filterWidthU);
}
return result;
}
private void BoxBlur(int[] source, int[] output, int width, int height, int radius) {
System.arraycopy(source, 0, output, 0, source.length);
BoxBlurHorizantal(output, source, width, height, radius);
BoxBlurVertical(source, output, width, height, radius);
}
private void BoxBlurHorizontal(int[] sourcePixels, int[] outputPixels, int width, int height, int radius) {
int resultingColorPixel;
float iarr = 1f / (radius + radius);
for (int i = 0; i < height; i++) {
int outputIndex = i * width;
int li = outputIndex;
int sourceIndex = outputIndex + radius;
int fv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex]);
int lv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex + width - 1]);
float val = (radius) * fv;
for (int j = 0; j < radius; j++) {
val += Byte.toUnsignedInt((byte) (sourcePixels[outputIndex + j]));
}
for (int j = 0; j < radius; j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex++]) - fv;
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
}
for (int j = (radius + 1); j < (width - radius); j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex++]) - Byte.toUnsignedInt((byte) sourcePixels[li++]);
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
}
for (int j = (width - radius); j < width; j++) {
val += lv - Byte.toUnsignedInt((byte) sourcePixels[li++]);
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
}
}
}
private void BoxBlurVertical(int[] sourcePixels, int[] outputPixels, int width, int height, int radius) {
int resultingColorPixel;
float iarr = 1f / (radius + radius + 1);
for (int i = 0; i < width; i++) {
int outputIndex = i;
int li = outputIndex;
int sourceIndex = outputIndex + radius * width;
int fv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex]);
int lv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex + width * (height - 1)]);
float val = (radius + 1) * fv;
for (int j = 0; j < radius; j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[outputIndex + j * width]);
}
for (int j = 0; j <= radius; j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex]) - fv;
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
sourceIndex += width;
outputIndex += width;
}
for (int j = radius + 1; j < (height - radius); j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex]) - Byte.toUnsignedInt((byte) sourcePixels[li]);
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
li += width;
sourceIndex += width;
outputIndex += width;
}
for (int j = (height - radius); j < height; j++) {
val += lv - Byte.toUnsignedInt((byte) sourcePixels[li]);
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
li += width;
outputIndex += width;
}
}
}
I have converted Ivan Kuckir's implementation of a fast Gaussian blur which uses three passes with linear box blurs to java. The resulting process is O(n) as he has stated at his own blog. If you would like to learn more about why does 3 time box blur approximates to Gaussian blur(3%) my friend you may check out box blur and Gaussian blur.
Here is the java implementation.
@Override
public BufferedImage ProcessImage(BufferedImage image) {
int width = image.getWidth();
int height = image.getHeight();
int[] pixels = image.getRGB(0, 0, width, height, null, 0, width);
int[] changedPixels = new int[pixels.length];
FastGaussianBlur(pixels, changedPixels, width, height, 12);
BufferedImage newImage = new BufferedImage(width, height, image.getType());
newImage.setRGB(0, 0, width, height, changedPixels, 0, width);
return newImage;
}
private void FastGaussianBlur(int[] source, int[] output, int width, int height, int radius) {
ArrayList<Integer> gaussianBoxes = CreateGausianBoxes(radius, 3);
BoxBlur(source, output, width, height, (gaussianBoxes.get(0) - 1) / 2);
BoxBlur(output, source, width, height, (gaussianBoxes.get(1) - 1) / 2);
BoxBlur(source, output, width, height, (gaussianBoxes.get(2) - 1) / 2);
}
private ArrayList<Integer> CreateGausianBoxes(double sigma, int n) {
double idealFilterWidth = Math.sqrt((12 * sigma * sigma / n) + 1);
int filterWidth = (int) Math.floor(idealFilterWidth);
if (filterWidth % 2 == 0) {
filterWidth--;
}
int filterWidthU = filterWidth + 2;
double mIdeal = (12 * sigma * sigma - n * filterWidth * filterWidth - 4 * n * filterWidth - 3 * n) / (-4 * filterWidth - 4);
double m = Math.round(mIdeal);
ArrayList<Integer> result = new ArrayList<>();
for (int i = 0; i < n; i++) {
result.add(i < m ? filterWidth : filterWidthU);
}
return result;
}
private void BoxBlur(int[] source, int[] output, int width, int height, int radius) {
System.arraycopy(source, 0, output, 0, source.length);
BoxBlurHorizantal(output, source, width, height, radius);
BoxBlurVertical(source, output, width, height, radius);
}
private void BoxBlurHorizontal(int[] sourcePixels, int[] outputPixels, int width, int height, int radius) {
int resultingColorPixel;
float iarr = 1f / (radius + radius);
for (int i = 0; i < height; i++) {
int outputIndex = i * width;
int li = outputIndex;
int sourceIndex = outputIndex + radius;
int fv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex]);
int lv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex + width - 1]);
float val = (radius) * fv;
for (int j = 0; j < radius; j++) {
val += Byte.toUnsignedInt((byte) (sourcePixels[outputIndex + j]));
}
for (int j = 0; j < radius; j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex++]) - fv;
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
}
for (int j = (radius + 1); j < (width - radius); j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex++]) - Byte.toUnsignedInt((byte) sourcePixels[li++]);
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
}
for (int j = (width - radius); j < width; j++) {
val += lv - Byte.toUnsignedInt((byte) sourcePixels[li++]);
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex++] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
}
}
}
private void BoxBlurVertical(int[] sourcePixels, int[] outputPixels, int width, int height, int radius) {
int resultingColorPixel;
float iarr = 1f / (radius + radius + 1);
for (int i = 0; i < width; i++) {
int outputIndex = i;
int li = outputIndex;
int sourceIndex = outputIndex + radius * width;
int fv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex]);
int lv = Byte.toUnsignedInt((byte) sourcePixels[outputIndex + width * (height - 1)]);
float val = (radius + 1) * fv;
for (int j = 0; j < radius; j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[outputIndex + j * width]);
}
for (int j = 0; j <= radius; j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex]) - fv;
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
sourceIndex += width;
outputIndex += width;
}
for (int j = radius + 1; j < (height - radius); j++) {
val += Byte.toUnsignedInt((byte) sourcePixels[sourceIndex]) - Byte.toUnsignedInt((byte) sourcePixels[li]);
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
li += width;
sourceIndex += width;
outputIndex += width;
}
for (int j = (height - radius); j < height; j++) {
val += lv - Byte.toUnsignedInt((byte) sourcePixels[li]);
resultingColorPixel = Byte.toUnsignedInt(((Integer) Math.round(val * iarr)).byteValue());
outputPixels[outputIndex] = (0xFF << 24) | (resultingColorPixel << 16) | (resultingColorPixel << 8) | (resultingColorPixel);
li += width;
outputIndex += width;
}
}
}
我会考虑为此使用 CUDA 或其他一些 GPU 编程工具包,特别是如果您想使用更大的内核。 如果做不到这一点,总是需要在装配中手动调整循环。
I would consider using CUDA or some other GPU programming toolkit for this, especially if you want to use a larger kernel. Failing that, there's always hand tweaking your loops in assembly.
Hierarchical pyramid of images with reduced resolution as in SIFT
Repeated box blurs motivated by the Central Limit Theorem. The Box Blur is central to Viola and Jones's face detection where they call it an integral image if I remember correctly. I think Haar-like features use something similar, too.
Stack Blur: a queue-based alternative somewhere between the convolutional and box blur approaches
Bessel filters although there is some debate about these
After reviewing all of these, I'm reminded that simple, poor approximations often work well in practice. In a different field, Alex Krizhevsky found ReLU's to be faster than the classic sigmoid function in his ground-breaking AlexNet, even though they appear at first glance to be a terrible approximation to the Sigmoid.
I struggled with this problem for my research and tried and interesting method for a fast Gaussian blur. First, as mentioned, it is best to separate the blur into two 1D blurs, but depending on your hardware for the actual calculation of the pixel values, you can actually pre-compute all possible values and store them in a look-up table.
In other words, pre-calculate every combination of Gaussian coefficient * input pixel value. Of course you will need to discreetize your coefficients, but I just wanted to add this solution. If you have an IEEE subscription, you can read more in Fast image blurring using Lookup Table for real time feature extraction.
Blurring using almost any kernel repeatedly will tend to a Gaussian kernel. This is what's so cool about the Gaussian distribution, and is why statisticians like it. So choose something that's easy to blur with and apply it several times.
For example, it's easy to blur with a box shaped kernel. First calculate a cumulative sum:
y(i) = y(i-1) + x(i)
then:
blurred(i) = y(i+radius) - y(i-radius)
Repeat several times.
Or you might go back and forth a few times with some variety of an IIR filter, these are similarly fast.
In 2D or higher:
Blur in each dimension one after the other, as DarenW said.
You should use the fact that a Gaussian kernel is separable, i. e. you can express a 2D convolution as a combination of two 1D convolutions.
If the filter is large, it may also make sense to use the fact that convolution in the spatial domain is equivalent to multiplication in the frequency (Fourier) domain. This means that you can take the Fourier transform of the image and the filter, multiply the (complex) results, and then take the inverse Fourier transform. The complexity of the FFT (Fast Fourier Transform) is O(n log n), while the complexity of a convolution is O(n^2). Also, if you need to blur many images with the same filter, you would only need to take the FFT of the filter once.
If you decide to go with using an FFT, the FFTW library is a good choice.
for (y = 0 to h-1) {
for (x = 0 to w-1) {
sum = 0;
for (j = -radius to radius) { //box blur
for (i = -radius to radius) {
sum += input[x + i, y + j]
}
}
output[x, y] = sum / (radius * 2 + 1)
}
}
注意:这是伪代码。 没有边界检查,但它很容易实现。
复杂度:O(n radius^2)
正如有人在之前答案的评论中所述,二维高斯可以分解为两个一维高斯(一个垂直,一个水平):
for (y = 0 to h-1) {
for (x = 0 to w-1) { //horizontal
sum = 0;
for (r = -radius to radius) {
sum += input[x + r, y]
}
tmp[x, y] = sum / (radius * 2 + 1)
}
}
for (x = 0 to w-1) {
for (y = 0 to h-1) { //vertical
sum = 0;
for (r = -radius to radius) {
sum += tmp[x, y + r]
}
output[x, y] = sum / (radius * 2 + 1)
}
}
for (y = 0 to h-1) {
sum = 0;
for (r = -radius to radius) {
sum += input[x + r, y]
}
for (x = 0 to w-1) { //horizontal
tmp[x, y] = sum / (radius * 2 + 1)
sum -= input[x - radius, y]
sum += input[x + radius + 1, y]
}
}
for (x = 0 to w-1) {
sum = 0;
for (r = -radius to radius) {
sum += tmp[x, y + r]
}
for (y = 0 to h-1) { //vertical
output[x, y] = sum / (radius * 2 + 1)
sum -= tmp[x, y - radius]
sum += tmp[x, y + radius + 1]
}
}
复杂度:O(n)
Here is 3 solutions, from worst to best, in terms of performance. This is inspired from this page.
A naïve implementation is to loop on each pixel and then to calculate the weighted average of the adjacent pixels. This is very slow, even if a pre-calculated weight table is used.
A faster approach is to use a box blur. It's similar to gaussian but weight is constant. If multiple passes are applied (eg: 3 passes) it gives a very good approximation of a gaussian blur.
for (y = 0 to h-1) {
for (x = 0 to w-1) {
sum = 0;
for (j = -radius to radius) { //box blur
for (i = -radius to radius) {
sum += input[x + i, y + j]
}
}
output[x, y] = sum / (radius * 2 + 1)
}
}
Note: it's pseudocode. there is no bounds checking but it's fairly easy to implement.
Complexity : O(n radius^2)
As someone stated in the comments of a previous answer, the 2D gaussian can be decomposed into two 1D ones (one vertical and one horizontal) :
for (y = 0 to h-1) {
for (x = 0 to w-1) { //horizontal
sum = 0;
for (r = -radius to radius) {
sum += input[x + r, y]
}
tmp[x, y] = sum / (radius * 2 + 1)
}
}
for (x = 0 to w-1) {
for (y = 0 to h-1) { //vertical
sum = 0;
for (r = -radius to radius) {
sum += tmp[x, y + r]
}
output[x, y] = sum / (radius * 2 + 1)
}
}
Complexity : O(n radius)
If you take a close look of second algorithm, you will notify that we always sum up the same values over and over.
sum = p0 + p1 + p2 + p3 + p4 + p5 + p6
sum = sum - p0 + p7
sum = sum - p1 + p8
Pseudocode:
for (y = 0 to h-1) {
sum = 0;
for (r = -radius to radius) {
sum += input[x + r, y]
}
for (x = 0 to w-1) { //horizontal
tmp[x, y] = sum / (radius * 2 + 1)
sum -= input[x - radius, y]
sum += input[x + radius + 1, y]
}
}
for (x = 0 to w-1) {
sum = 0;
for (r = -radius to radius) {
sum += tmp[x, y + r]
}
for (y = 0 to h-1) { //vertical
output[x, y] = sum / (radius * 2 + 1)
sum -= tmp[x, y - radius]
sum += tmp[x, y + radius + 1]
}
}
Math jocks are likely to know this, but for anyone else..
Due to a nice mathematical propertiy of the Gaussian, you can blur a 2D image quickly by first running a 1D Gaussian blur on each row of the image, then run a 1D blur on each column.
发布评论
评论(16)
我发现Quasimondo:孵化器:处理:快速高斯模糊。 该方法包含很多近似值,例如使用整数和查找表而不是浮点数和浮点除法。 我不知道现代 Java 代码有多少加速。
矩形上的快速阴影有一个使用B 样条。
C#中的快速高斯模糊算法 声称有一些很酷的优化。
此外,快速高斯模糊 (PDF) David Everly 有一种快速的高斯模糊处理方法。
我会尝试各种方法,对它们进行基准测试并将结果发布在这里。
出于我的目的,我从互联网上复制并实现了基本(独立处理 XY 轴)方法和 David Everly 的快速高斯模糊方法。 它们的参数不同,所以我无法直接比较它们。 然而,对于大的模糊半径,后者经历的迭代次数要少得多。 而且,后者是一种近似算法。
I found Quasimondo : Incubator : Processing : Fast Gaussian Blur. This method contains a lot of approximations like using integers and look up tables instead of floats and floating point divisions. I don't know how much speedup that is in modern Java code.
Fast Shadows on Rectangles has an approximating algorithm using B-splines.
Fast Gaussian Blur Algorithm in C# claims to have some cool optimizations.
Also, Fast Gaussian Blur (PDF) by David Everly has a fast method for Gaussian blur processing.
I would try out the various methods, benchmark them and post the results here.
For my purposes, I have copied and implemented the basic (process X-Y axis independently) method and David Everly's Fast Gaussian Blur method from the Internet. They differ in parameters, so I couldn't compare them directly. However the latter goes through much fewer number of iterations for a large blur radius. Also, the latter is an approximate algorithm.
使用现已实现(截至 2016 年)的新库来回答这个老问题,因为 Java 的 GPU 技术有许多新的进步。
正如其他几个答案所建议的,CUDA 是一种替代方案。 但是java 现在支持 CUDA。
IBM CUDA4J 库:提供用于管理和访问 GPU 设备、库、内核和内存的 Java API。 使用这些新的 API,可以编写管理 GPU 设备特性的 Java 程序,并利用 Java 内存模型、异常和自动资源管理的便利性将工作卸载到 GPU。
Jcuda:NVIDIA CUDA 和相关库的 Java 绑定。 使用 JCuda,可以从 Java 程序与 CUDA 运行时和驱动程序 API 进行交互。
Aparapi:允许Java开发人员通过在GPU上执行数据并行代码片段来利用GPU和APU设备的计算能力,而不是局限于本地CPU。
一些Java OpenCL绑定库
https://github.com/ ochafik/JavaCL :OpenCL 的 Java 绑定:面向对象的 OpenCL 库,基于自动生成的低级绑定
http://jogamp.org/jocl/www/:OpenCL 的 Java 绑定:面向对象的 OpenCL 库,基于自动生成的低级绑定
http://www.lwjgl.org/ :OpenCL 的 Java 绑定:自动生成的低级绑定和面向对象的便利类
http://jocl.org/ :OpenCL 的 Java 绑定:低级绑定,是原始 OpenCL API 的 1:1 映射
上述所有库都将有助于比 CPU 上的 Java 实现更快地实现高斯模糊。
Answering this old question with new libraries that have been implemented now(as of 2016), because there are many new advancements to the GPU technology with Java.
As suggested in few other answers, CUDA is an alternative. But java has CUDA support now.
IBM CUDA4J library: provides a Java API for managing and accessing GPU devices, libraries, kernels, and memory. Using these new APIs it is possible to write Java programs that manage GPU device characteristics and offload work to the GPU with the convenience of the Java memory model, exceptions, and automatic resource management.
Jcuda: Java bindings for NVIDIA CUDA and related libraries. With JCuda it is possible to interact with the CUDA runtime and driver API from Java programs.
Aparapi: allows Java developers to take advantage of the compute power of GPU and APU devices by executing data parallel code fragments on the GPU rather than being confined to the local CPU.
Some Java OpenCL binding libraries
https://github.com/ochafik/JavaCL : Java bindings for OpenCL: An object-oriented OpenCL library, based on auto-generated low-level bindings
http://jogamp.org/jocl/www/ : Java bindings for OpenCL: An object-oriented OpenCL library, based on auto-generated low-level bindings
http://www.lwjgl.org/ : Java bindings for OpenCL: Auto-generated low-level bindings and object-oriented convenience classes
http://jocl.org/ : Java bindings for OpenCL: Low-level bindings that are a 1:1 mapping of the original OpenCL API
All these above libraries will help in implementing Gaussian Blur faster than any implementation in Java on CPU.
您可能需要框模糊,这样速度要快得多。 请参阅此链接获取精彩教程和一些复制 & 粘贴 C 代码。
You probably want the box blur, which is much faster. See this link for a great tutorial and some copy & paste C code.
对于更大的模糊半径,请尝试应用框模糊三次。 这将非常接近高斯模糊,并且比真正的高斯模糊快得多。
For larger blur radiuses, try applying a box blur three times. This will approximate a Gaussian blur very well, and be much faster than a true Gaussian blur.
我已将 Ivan Kuckir 的快速高斯模糊实现转换为 java,该实现使用带有线性框模糊的三遍。 正如他在他自己的博客中所说,最终的过程是 O(n)。 如果您想了解更多关于为什么 3 次框模糊近似于高斯模糊(3%)的信息,我的朋友,您可以查看 框模糊和高斯模糊。
这是java实现。
I have converted Ivan Kuckir's implementation of a fast Gaussian blur which uses three passes with linear box blurs to java. The resulting process is O(n) as he has stated at his own blog. If you would like to learn more about why does 3 time box blur approximates to Gaussian blur(3%) my friend you may check out box blur and Gaussian blur.
Here is the java implementation.
我会考虑为此使用 CUDA 或其他一些 GPU 编程工具包,特别是如果您想使用更大的内核。 如果做不到这一点,总是需要在装配中手动调整循环。
I would consider using CUDA or some other GPU programming toolkit for this, especially if you want to use a larger kernel. Failing that, there's always hand tweaking your loops in assembly.
我在不同的地方看到了几个答案,现在将它们收集在这里,以便我可以尝试集中注意力并记住它们以供以后使用:
无论您使用哪种方法,使用一维滤波器分别过滤水平和垂直尺寸,而不是使用单个方形滤波器。
在回顾了所有这些之后,我想起了这一点简单的、较差的近似在实践中通常效果很好。 在另一个领域,Alex Krizhevsky 发现 ReLU 比他开创性的 AlexNet 中的经典 sigmoid 函数更快,尽管乍一看它们与 Sigmoid 非常接近。
I have seen several answers in different places and am collecting them here so that I can try to wrap my mind around them and remember them for later:
Regardless of which approach you use, filter horizontal and vertical dimensions separately with 1D filters rather than using a single square filter.
After reviewing all of these, I'm reminded that simple, poor approximations often work well in practice. In a different field, Alex Krizhevsky found ReLU's to be faster than the classic sigmoid function in his ground-breaking AlexNet, even though they appear at first glance to be a terrible approximation to the Sigmoid.
有几种快速方法可用于二维数据的高斯模糊。 你应该了解什么。
您的选择取决于所需的速度、精度和实现复杂性。
There are several fast methods for gauss blur of 2d data. What you should know about.
Your choice is depend from required speed, precision, and implementation complexity.
CWP 的 Dave Hale 有一个minejtk 包,其中包括递归高斯滤波器(Deriche 方法和 Van Vliet 方法)。 java子例程可以在 https://github.com/dhale/jtk/blob/0350c23f91256181d415ea7369dbd62855ac4460/core/src/main/java/edu/mines/jtk/dsp/RecursiveGaussianFilter.java
Deriche的方法似乎是一个对于高斯模糊(以及高斯导数)来说非常好。
Dave Hale from CWP has a minejtk package, which includes recursive Gaussian filter (Deriche method and Van Vliet method). The java subroutine can be found at https://github.com/dhale/jtk/blob/0350c23f91256181d415ea7369dbd62855ac4460/core/src/main/java/edu/mines/jtk/dsp/RecursiveGaussianFilter.java
Deriche's method seems to be a very good one for Gaussian blur (and also for Gaussian's derivatives).
尝试像我在这里那样使用 Box Blur:
使用扩展盒模糊来近似高斯模糊
这是最好的近似值。
使用积分图像可以使其更快。
如果您这样做,请分享您的解决方案。
Try using Box Blur the way I did here:
Approximating Gaussian Blur Using Extended Box Blur
This is the best approximation.
Using Integral Images you can make it even faster.
If you do, Please share your solution.
我在研究中努力解决这个问题,并尝试了有趣的快速高斯模糊方法。 首先,如上所述,最好将模糊分成两个一维模糊,但根据实际计算像素值的硬件,您实际上可以预先计算所有可能的值并将它们存储在查找表中。
换句话说,预先计算
高斯系数
*输入像素值
的每个组合。 当然,您需要离散化系数,但我只是想添加这个解决方案。 如果您有 IEEE 订阅,您可以在 使用查找表进行快速图像模糊以进行实时特征提取。最终,我最终使用了 CUDA :)
I struggled with this problem for my research and tried and interesting method for a fast Gaussian blur. First, as mentioned, it is best to separate the blur into two 1D blurs, but depending on your hardware for the actual calculation of the pixel values, you can actually pre-compute all possible values and store them in a look-up table.
In other words, pre-calculate every combination of
Gaussian coefficient
*input pixel value
. Of course you will need to discreetize your coefficients, but I just wanted to add this solution. If you have an IEEE subscription, you can read more in Fast image blurring using Lookup Table for real time feature extraction.Ultimately, I ended up using CUDA though :)
在一维中:
重复使用几乎任何内核进行模糊将趋向于高斯内核。 这就是高斯分布的魅力所在,也是统计学家喜欢它的原因。 因此,选择容易模糊的东西并多次应用。
例如,使用盒形内核很容易模糊。 首先计算累计和:
然后:
重复几次。
或者您可能会使用某种 IIR 过滤器来回几次,这些是同样快。
在 2D 或更高维度中:
在每个维度中一个接一个地模糊,如 DarenW 所说。
In 1D:
Blurring using almost any kernel repeatedly will tend to a Gaussian kernel. This is what's so cool about the Gaussian distribution, and is why statisticians like it. So choose something that's easy to blur with and apply it several times.
For example, it's easy to blur with a box shaped kernel. First calculate a cumulative sum:
then:
Repeat several times.
Or you might go back and forth a few times with some variety of an IIR filter, these are similarly fast.
In 2D or higher:
Blur in each dimension one after the other, as DarenW said.
您应该利用高斯核是可分离的这一事实,即您可以将 2D 卷积表示为两个 1D 卷积的组合。
如果滤波器很大,那么使用空间域中的卷积相当于频域(傅立叶)域中的乘法这一事实也可能有意义。 这意味着您可以对图像和滤波器进行傅里叶变换,将(复数)结果相乘,然后进行傅里叶逆变换。 FFT(快速傅立叶变换)的复杂度为 O(n log n),而卷积的复杂度为 O(n^2)。 此外,如果您需要使用相同的滤镜对许多图像进行模糊处理,则只需对滤镜进行一次 FFT。
如果您决定使用 FFT,FFTW 库 是一个不错的选择。
You should use the fact that a Gaussian kernel is separable, i. e. you can express a 2D convolution as a combination of two 1D convolutions.
If the filter is large, it may also make sense to use the fact that convolution in the spatial domain is equivalent to multiplication in the frequency (Fourier) domain. This means that you can take the Fourier transform of the image and the filter, multiply the (complex) results, and then take the inverse Fourier transform. The complexity of the FFT (Fast Fourier Transform) is O(n log n), while the complexity of a convolution is O(n^2). Also, if you need to blur many images with the same filter, you would only need to take the FFT of the filter once.
If you decide to go with using an FFT, the FFTW library is a good choice.
最好在小块上完成,因为全图像转置速度很慢,而使用链可以非常快地完成小块转置PUNPCK (PUNPCKHBW、PUNPCKHDQ、PUNPCKHWD、PUNPCKLBW、PUNPCKLDQ、PUNPCKLWD)。
It is best done on small blocks, as a full-image transpose is slow, while a small-block transpose can be done extremely fast using a chain of PUNPCKs (PUNPCKHBW, PUNPCKHDQ, PUNPCKHWD, PUNPCKLBW, PUNPCKLDQ, PUNPCKLWD).
以下是性能方面从最差到最佳的 3 种解决方案。 这是受到此页面的启发。
一个简单的实现是循环每个像素,然后计算相邻像素的加权平均值。 即使使用预先计算的权重表,这也是非常慢的。
框模糊
。 它与高斯类似,但权重是恒定的。 如果应用多个通道(例如:3 个通道),它会提供非常好的高斯模糊近似值。注意:这是伪代码。 没有边界检查,但它很容易实现。
复杂度:O(n radius^2)
复杂度: O(n radius)
这可以简化:
伪代码:
复杂度:O(n)
Here is 3 solutions, from worst to best, in terms of performance. This is inspired from this page.
A naïve implementation is to loop on each pixel and then to calculate the weighted average of the adjacent pixels. This is very slow, even if a pre-calculated weight table is used.
box blur
. It's similar to gaussian but weight is constant. If multiple passes are applied (eg: 3 passes) it gives a very good approximation of a gaussian blur.Note: it's pseudocode. there is no bounds checking but it's fairly easy to implement.
Complexity : O(n radius^2)
Complexity : O(n radius)
This can be simplified:
Pseudocode:
Complexity : O(n)
数学运动员可能知道这一点,但对其他人来说..
由于高斯的良好数学特性,您可以通过首先在图像的每一行上运行一维高斯模糊,然后运行一维模糊来快速模糊 2D 图像在每一列上。
Math jocks are likely to know this, but for anyone else..
Due to a nice mathematical propertiy of the Gaussian, you can blur a 2D image quickly by first running a 1D Gaussian blur on each row of the image, then run a 1D blur on each column.