It is fairly easy, since it's a filter that must be run in parallel given an input image. I haven't encountered many difficulties on this, since it just required a simple kernel. The speedup was at about 300x. This was my final project on CS. The project can be found here (it's written in Portuguese thou).
I have tried writing the Mumford&Shah segmentation algorithm too, but that has been a pain to write, since CUDA is still in the beginning and so lots of strange things happen. I have even seen a performance improvement by adding a if (false){} in the code O_O.
The results for this segmentation algorithm weren't good. I had a performance loss of 20x compared to a CPU approach (however, since it's a CPU, a different approach that yelded the same results could be taken). It's still a work in progress, but unfortunaly I left the lab I was working on, so maybe someday I might finish it.
我一直在使用 ATI 的流 SDK 而不是 Cuda 进行 gpgpu 开发。 您将获得什么样的性能提升取决于很多因素,但最重要的是数字强度。 (即计算操作与内存引用的比率。)
BLAS 1 级或 BLAS 2 级函数(例如添加两个向量)仅对每 3 个内存引用执行 1 次数学运算,因此 NI 为 (1/3)。 使用 CAL 或 Cuda 的运行速度总是比仅在 CPU 上运行慢。 主要原因是数据从 CPU 传输到 GPU 并返回所需的时间。
对于像 FFT 这样的函数,需要 O(N log N) 次计算和 O(N) 次内存引用,因此 NI 为 O(log N)。 如果 N 很大,比如 1,000,000,那么在 GPU 上执行可能会更快; 如果 N 很小,比如 1,000,那么几乎肯定会更慢。
对于 BLAS level-3 或 LAPACK 函数(例如矩阵的 LU 分解或查找其特征值),需要 O( N^3) 次计算和 O(N^2) 次内存引用,因此 NI 为 O(N)。 对于非常小的数组,假设 N 是几个分数,这在 CPU 上执行起来仍然会更快,但是随着 N 的增加,算法很快就会从内存限制变为计算限制,并且 GPU 上的性能提升也会非常快迅速地。
I have been doing gpgpu development with ATI's stream SDK instead of Cuda. What kind of performance gain you will get depends on a lot of factors, but the most important is the numeric intensity. (That is, the ratio of compute operations to memory references.)
A BLAS level-1 or BLAS level-2 function like adding two vectors only does 1 math operation for each 3 memory references, so the NI is (1/3). This is always run slower with CAL or Cuda than just doing in on the cpu. The main reason is the time it takes to transfer the data from the cpu to the gpu and back.
For a function like FFT, there are O(N log N) computations and O(N) memory references, so the NI is O(log N). If N is very large, say 1,000,000 it will likely be faster to do it on the gpu; If N is small, say 1,000 it will almost certainly be slower.
For a BLAS level-3 or LAPACK function like LU decomposition of a matrix, or finding its eigenvalues, there are O( N^3) computations and O(N^2) memory references, so the NI is O(N). For very small arrays, say N is a few score, this will still be faster to do on the cpu, but as N increases, the algorithm very quickly goes from memory-bound to compute-bound and the performance increase on the gpu rises very quickly.
Anything involving complex arithemetic has more computations than scalar arithmetic, which usually doubles the NI and increases gpu performance.
I have used CUDA for several image processing algorithms. These applications, of course, are very well suited for CUDA (or any GPU processing paradigm).
IMO, there are three typical stages when porting an algorithm to CUDA:
Initial Porting: Even with a very basic knowledge of CUDA, you can port simple algorithms within a few hours. If you are lucky, you gain a factor of 2 to 10 in performance.
Trivial Optimizations: This includes using textures for input data and padding of multi-dimensional arrays. If you are experienced, this can be done within a day and might give you another factor of 10 in performance. The resulting code is still readable.
Hardcore Optimizations: This includes copying data to shared memory to avoid global memory latency, turning the code inside out to reduce the number of used registers, etc. You can spend several weeks with this step, but the performance gain is not really worth it in most cases. After this step, your code will be so obfuscated that nobody understands it (including you).
This is very similar to optimizing a code for CPUs. However, the response of a GPU to performance optimizations is even less predictable than for CPUs.
I implemented a Genetic Algorithm on the GPU and got speed ups of around 7.. More gains are possible with a higher numeric intensity as someone else pointed out. So yes, the gains are there, if the application is right
I wrote a complex valued matrix multiplication kernel that beat the cuBLAS implementation by about 30% for the application I was using it for, and a sort of vector outer product function that ran several orders of magnitude than a multiply-trace solution for the rest of the problem.
It was a final year project. It took me a full year.
I've implemented a Monte Carlo calculation in CUDA for some financial use. The optimised CUDA code is about 500x faster than a "could have tried harder, but not really" multi-threaded CPU implementation. (Comparing a GeForce 8800GT to a Q6600 here). It is well know that Monte Carlo problems are embarrassingly parallel though.
Major issues encountered involves the loss of precision due to G8x and G9x chip's limitation to IEEE single precision floating point numbers. With the release of the GT200 chips this could be mitigated to some extent by using the double precision unit, at the cost of some performance. I haven't tried it out yet.
Also, since CUDA is a C extension, integrating it into another application can be non-trivial.
最后,这个 论文分析了 CUDA 如何很好地应用于结构化和非结构化网格等多种应用、组合逻辑、动态规划和数据挖掘。
While I haven't got any practical experiences with CUDA yet, I have been studying the subject and found a number of papers which document positive results using GPGPU APIs (they all include CUDA).
This paper describes how database joins can be paralellized by creating a number of parallel primitives (map, scatter, gather etc.) which can be combined into an efficient algorithm.
In this paper, a parallel implementation of the AES encryption standard is created with comparable speed to discreet encryption hardware.
Finally, this paper analyses how well CUDA applies to a number of applications such as structured and unstructured grids, combination logic, dynamic programming and data mining.
I have been using GPGPU for motion detection (Originally using CG and now CUDA) and stabilization (using CUDA) with image processing. I've been getting about a 10-20X speedup in these situations.
From what I've read, this is fairly typical for data-parallel algorithms.
发布评论
评论(10)
是的。 我已经使用 非线性各向异性扩散过滤器 实现了CUDA API。
这相当简单,因为它是一个必须在给定输入图像的情况下并行运行的过滤器。 我在这方面没有遇到很多困难,因为它只需要一个简单的内核。 加速比约为 300 倍。 这是我关于 CS 的最后一个项目。 该项目可以在此处找到(它是用葡萄牙语写的)。
我尝试编写 Mumford& Shah 分割算法也是如此,但是写起来很痛苦,因为 CUDA 仍处于起步阶段,所以会发生很多奇怪的事情。 我什至发现通过在代码 O_O 中添加
if (false){}
可以提高性能。该分割算法的结果并不好。 与 CPU 方法相比,我的性能损失了 20 倍(但是,由于它是 CPU,因此可以采用产生相同结果的不同方法)。 这仍然是一项正在进行的工作,但不幸的是我离开了我正在从事的实验室,所以也许有一天我可能会完成它。
Yes. I have implemented the Nonlinear Anisotropic Diffusion Filter using the CUDA api.
It is fairly easy, since it's a filter that must be run in parallel given an input image. I haven't encountered many difficulties on this, since it just required a simple kernel. The speedup was at about 300x. This was my final project on CS. The project can be found here (it's written in Portuguese thou).
I have tried writing the Mumford&Shah segmentation algorithm too, but that has been a pain to write, since CUDA is still in the beginning and so lots of strange things happen. I have even seen a performance improvement by adding a
if (false){}
in the code O_O.The results for this segmentation algorithm weren't good. I had a performance loss of 20x compared to a CPU approach (however, since it's a CPU, a different approach that yelded the same results could be taken). It's still a work in progress, but unfortunaly I left the lab I was working on, so maybe someday I might finish it.
我一直在使用 ATI 的流 SDK 而不是 Cuda 进行 gpgpu 开发。
您将获得什么样的性能提升取决于很多因素,但最重要的是数字强度。 (即计算操作与内存引用的比率。)
BLAS 1 级或 BLAS 2 级函数(例如添加两个向量)仅对每 3 个内存引用执行 1 次数学运算,因此 NI 为 (1/3)。 使用 CAL 或 Cuda 的运行速度总是比仅在 CPU 上运行慢。 主要原因是数据从 CPU 传输到 GPU 并返回所需的时间。
对于像 FFT 这样的函数,需要 O(N log N) 次计算和 O(N) 次内存引用,因此 NI 为 O(log N)。 如果 N 很大,比如 1,000,000,那么在 GPU 上执行可能会更快; 如果 N 很小,比如 1,000,那么几乎肯定会更慢。
对于 BLAS level-3 或 LAPACK 函数(例如矩阵的 LU 分解或查找其特征值),需要 O( N^3) 次计算和 O(N^2) 次内存引用,因此 NI 为 O(N)。 对于非常小的数组,假设 N 是几个分数,这在 CPU 上执行起来仍然会更快,但是随着 N 的增加,算法很快就会从内存限制变为计算限制,并且 GPU 上的性能提升也会非常快迅速地。
任何涉及复杂算术的计算量都比标量算术更多,这通常会使 NI 翻倍并提高 GPU 性能。
(来源:earthlink.net)
这里是 CGEMM 的性能——在 Radeon 4870 上完成的复杂单精度矩阵-矩阵乘法。
I have been doing gpgpu development with ATI's stream SDK instead of Cuda.
What kind of performance gain you will get depends on a lot of factors, but the most important is the numeric intensity. (That is, the ratio of compute operations to memory references.)
A BLAS level-1 or BLAS level-2 function like adding two vectors only does 1 math operation for each 3 memory references, so the NI is (1/3). This is always run slower with CAL or Cuda than just doing in on the cpu. The main reason is the time it takes to transfer the data from the cpu to the gpu and back.
For a function like FFT, there are O(N log N) computations and O(N) memory references, so the NI is O(log N). If N is very large, say 1,000,000 it will likely be faster to do it on the gpu; If N is small, say 1,000 it will almost certainly be slower.
For a BLAS level-3 or LAPACK function like LU decomposition of a matrix, or finding its eigenvalues, there are O( N^3) computations and O(N^2) memory references, so the NI is O(N). For very small arrays, say N is a few score, this will still be faster to do on the cpu, but as N increases, the algorithm very quickly goes from memory-bound to compute-bound and the performance increase on the gpu rises very quickly.
Anything involving complex arithemetic has more computations than scalar arithmetic, which usually doubles the NI and increases gpu performance.
(source: earthlink.net)
Here is the performance of CGEMM -- complex single precision matrix-matrix multiplication done on a Radeon 4870.
我已经将 CUDA 用于多种图像处理算法。 当然,这些应用程序非常适合 CUDA(或任何 GPU 处理范例)。
IMO,将算法移植到 CUDA 时会经历三个典型阶段:
这与优化 CPU 代码非常相似。 然而,GPU 对性能优化的响应比 CPU 更难以预测。
I have used CUDA for several image processing algorithms. These applications, of course, are very well suited for CUDA (or any GPU processing paradigm).
IMO, there are three typical stages when porting an algorithm to CUDA:
This is very similar to optimizing a code for CPUs. However, the response of a GPU to performance optimizations is even less predictable than for CPUs.
我已经编写了一些简单的应用程序,如果您可以并行化浮点计算,那确实很有帮助。
我发现伊利诺伊大学香槟分校教授和 NVIDIA 工程师教授的以下课程在我刚开始使用时非常有用:http://courses.ece.illinois.edu/ece498/al/Archive/Spring2007/Syllabus.html(包括所有讲座的录音)。
I have written trivial applications, it really helps if you can parallize floating point calculations.
I found the following course cotaught by a University of Illinois Urbana Champaign professor and an NVIDIA engineer very useful when I was getting started: http://courses.ece.illinois.edu/ece498/al/Archive/Spring2007/Syllabus.html (includes recordings of all lectures).
我在 GPU 上实现了遗传算法,并获得了大约 7 的加速。正如其他人指出的那样,更高的数值强度可能带来更多收益。 所以是的,如果应用正确的话,收益就在那里
I implemented a Genetic Algorithm on the GPU and got speed ups of around 7.. More gains are possible with a higher numeric intensity as someone else pointed out. So yes, the gains are there, if the application is right
我编写了一个复值矩阵乘法内核,对于我使用它的应用程序来说,它比 cuBLAS 实现快了大约 30%,并且编写了一种向量外积函数,它比其余的乘法跟踪解决方案运行了几个数量级。问题。
这是最后一年的项目。 我花了整整一年的时间。
http://www.maths.tcd.ie/~oconbhup/Maths_Project.pdf
I wrote a complex valued matrix multiplication kernel that beat the cuBLAS implementation by about 30% for the application I was using it for, and a sort of vector outer product function that ran several orders of magnitude than a multiply-trace solution for the rest of the problem.
It was a final year project. It took me a full year.
http://www.maths.tcd.ie/~oconbhup/Maths_Project.pdf
我已经在 CUDA 中实现了蒙特卡罗计算,用于一些财务用途。 优化后的 CUDA 代码比“可以更努力但并非真正”的多线程 CPU 实现快约 500 倍。 (此处将 GeForce 8800GT 与 Q6600 进行比较)。 众所周知,蒙特卡罗问题是令人尴尬地并行的。
遇到的主要问题包括由于 G8x 和 G9x 芯片对 IEEE 单精度浮点数的限制而导致的精度损失。 随着 GT200 芯片的发布,通过使用双精度单元可以在一定程度上缓解这种情况,但会牺牲一些性能。 我还没有尝试过。
此外,由于 CUDA 是 C 扩展,因此将其集成到另一个应用程序中可能并不简单。
I've implemented a Monte Carlo calculation in CUDA for some financial use. The optimised CUDA code is about 500x faster than a "could have tried harder, but not really" multi-threaded CPU implementation. (Comparing a GeForce 8800GT to a Q6600 here). It is well know that Monte Carlo problems are embarrassingly parallel though.
Major issues encountered involves the loss of precision due to G8x and G9x chip's limitation to IEEE single precision floating point numbers. With the release of the GT200 chips this could be mitigated to some extent by using the double precision unit, at the cost of some performance. I haven't tried it out yet.
Also, since CUDA is a C extension, integrating it into another application can be non-trivial.
虽然我还没有任何 CUDA 的实际经验,但我一直在研究这个主题,并发现了一些记录使用 GPGPU API 的积极结果的论文(它们都包含 CUDA)。
这篇论文描述了如何通过创建多个来并行化数据库连接并行原语(映射、分散、聚集等)可以组合成有效的算法。
在这篇论文中,创建了 AES 加密标准的并行实现,其速度与谨慎的加密硬件。
最后,这个 论文分析了 CUDA 如何很好地应用于结构化和非结构化网格等多种应用、组合逻辑、动态规划和数据挖掘。
While I haven't got any practical experiences with CUDA yet, I have been studying the subject and found a number of papers which document positive results using GPGPU APIs (they all include CUDA).
This paper describes how database joins can be paralellized by creating a number of parallel primitives (map, scatter, gather etc.) which can be combined into an efficient algorithm.
In this paper, a parallel implementation of the AES encryption standard is created with comparable speed to discreet encryption hardware.
Finally, this paper analyses how well CUDA applies to a number of applications such as structured and unstructured grids, combination logic, dynamic programming and data mining.
我一直在使用 GPGPU 进行运动检测(最初使用 CG,现在使用 CUDA)和图像处理稳定(使用 CUDA)。
在这些情况下,我的速度提高了 10-20 倍。
据我所知,这对于数据并行算法来说是相当典型的。
I have been using GPGPU for motion detection (Originally using CG and now CUDA) and stabilization (using CUDA) with image processing.
I've been getting about a 10-20X speedup in these situations.
From what I've read, this is fairly typical for data-parallel algorithms.
我已经使用 ATI Stream SDK 在 GPU 上实现了 Cholesky 分解来求解大型线性方程。 我的观察是
性能加速高达 10 倍。
通过将其扩展到多个 GPU,致力于解决同一问题以进一步优化它。
I have implemented Cholesky Factorization for solving large linear equation on GPU using ATI Stream SDK. My observations were
Got performance speedup upto 10 times.
Working on same problem to optimize it more, by scaling it to multiple GPUs.