如何在 cuda 中管理大型 2D FFT
我已经成功编写了一些 CUDA FFT 代码,可以对图像进行 2D 卷积,以及一些其他计算。
我如何弄清楚我可以运行的最大 FFT 是多少? 2D R2C 卷积的计划似乎需要 2 倍的图像大小,以及 C2R 的另外 2 倍的图像大小。这看起来开销很大!
另外,似乎大多数基准测试都是针对相对较小的 FFT 的……这是为什么?似乎对于大图像,我很快就会耗尽内存。这通常是如何处理的?您能否对图像的图块执行 FFT 卷积并组合这些结果,并期望它与我对整个图像运行 2D FFT 相同?
感谢您回答这些问题
I have succesfully written some CUDA FFT code that does a 2D convolution of an image, as well as some other calculations.
How do I go about figuring out what the largest FFT's I can run are? It seems to be that a plan for a 2D R2C convolution takes 2x the image size, and another 2x the image size for the C2R. This seems like a lot of overhead!
Also, it seems like most of the benchmarks and such are for relatively small FFTs..why is this? It seems like for large images, I am going to quickly run out of memory. How is this typically handled? Can you perform an FFT convolution on a tile of an image and combine those results, and expect it to be the same as if I had run a 2D FFT on the entire image?
Thanks for answering these questions
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
CUFFT 根据您的图像大小计划不同的算法。如果您无法适应共享内存并且不是 2 的幂,那么 CUFFT 会计划进行异位变换,而尺寸合适的较小图像将更适合该软件。
如果您打算对整个图像进行 FFT,并且需要了解您的 GPU 可以处理什么,我最好的答案是猜测并检查不同的图像大小,因为 CUFFT 规划很复杂。
请参阅文档:http://developer.download.nvidia.com/compute/cuda/ 1_1/CUFFT_Library_1.1.pdf
我同意 Mark 的观点,认为平铺图像是进行卷积的方法。由于卷积相当于只计算许多独立的积分,因此您可以简单地将域分解为其组成部分,独立计算这些部分,然后将它们缝合在一起。 FFT 卷积技巧只是降低了需要计算的积分的复杂性。
我希望你的 GPU 代码在所有情况下都应该比 matlab 的性能好很多,除非你做了一些奇怪的事情。
CUFFT plans a different algorithm depending on your image size. If you can't fit in shared memory and are not a power of 2 then CUFFT plans an out-of-place transform while smaller images with the right size will be more amenable to the software.
If you're set on FFTing the whole image and need to see what your GPU can handle my best answer would be to guess and check with different image sizes as the CUFFT planning is complicated.
See the documentation : http://developer.download.nvidia.com/compute/cuda/1_1/CUFFT_Library_1.1.pdf
I agree with Mark and say that tiling the image is the way to go for convolution. Since convolution amounts to just computing many independent integrals you can simply decompose the domain into its constituent parts, compute those independently, and stitch them back together. The FFT convolution trick simply reduces the complexity of the integrals you need to compute.
I expect that your GPU code should outperform matlab by a large factor in all situations unless you do something weird.
对整个图像运行 FFT 通常是不切实际的。它不仅需要大量内存,而且图像的宽度和高度必须是 2 的幂,这对您的输入造成了不合理的限制。
将图像切割成图块是完全合理的。图块的大小将决定您能够实现的频率分辨率。您可能还想重叠瓷砖。
It's not usually practical to run FFT on an entire image. Not only does it take a lot of memory, but the image must be a power of 2 in width and height which places an unreasonable constraint on your input.
Cutting the image into tiles is perfectly reasonable. The size of the tiles will determine the frequency resolution you're able to achieve. You may want to overlap the tiles as well.