对真实输入数据进行高效的 2D FFT?

发布于 2024-09-28 03:00:40 字数 1050 浏览 0 评论 0原文

我目前正在使用 opencl 对真实输入数据实现二维 FFT(更具体地说是使用 FFT 的快速 2D 卷积,所以我只需要一些行为足够相似的东西来应用卷积)。 2D FFT 是在行上使用 1D FFT,然后在列上使用 1D FFT 来实现的。

为了提高效率,我尝试将 FFT 的对称性与实际输入结合使用,以便能够计算更小的 FFT。我发现我可以将两行合并为一行,使用第一行作为实数分量,第二行作为虚数分量,对结果行执行第一个 1D FFT,然后使用对称性来构造各个行的 1D FFT 结果行由此而来。所以我所做的基本上如下:

fg 是矩阵中的行。

  1. 构造x = f + i * g
  2. 变换得到F(x) = F(f) + i * F(g)
  3. 使用对称性提取F( f)F(g) 来自 F(x)

但是我不能直接将结果输入到第二个 1D FFT 中,因为在这种情况下我不会变换整个矩阵,而是变换两个子矩阵。然而,在转换之间提取数据意味着要么存储更多数据(在实际输入上表达 1D FFT 结果所需的 n/2+1 条目),要么组合索引 0< 处的元素/code> 和索引 n/2 到一个元素中(使用相同的技巧进行组合,因为两个数字都保证是真实的)并使用相同的存储量,但必须为此做一个特殊的情况在我的卷积中。

因为我尝试尽可能多地重用缓冲区(由于 GPU 上可用的 RAM 有限),所以使用更多存储并不是一个好的解决方案。此外,我的算法无法处理不是 2 的幂/16 的倍数的矩阵大小(因内核而异)。我也宁愿避免引入特殊情况,因为这些会使我的内核变得更加复杂,从而损害效率(我已经在最小化每个内核使用的寄存器数量方面遇到了麻烦)。

所以我的问题是,是否有一种优雅的方法来解决这个问题,这意味着一种无需使用更多内存或某些元素的特殊情况即可工作的方法?

理想情况下,我希望能够完成整个 FFT,而无需在 FFT 中间拆分组合数据,但我不确定这是否可能。

I'm currently implementing a two dimensional FFT for real input data using opencl (more specifically a fast 2D convolution using FFTs, so I only need something which behaves similary enough to apply the convolution to). The 2D FFT is implemented using an 1D FFT on the rows and afterwards an 1D FFT on the cols.

To make this more efficient I'm trying to use the symmetries of FFTs with real input in order to be able to calculate smaller FFTs. I found that I can combine two rows into one, using the first as real component, the second as imaginary component, do the first 1D FFT on the resulting row and then use the symmetry properties to construct the results of the 1D FFTs of the individual rows from that. So what I'm doing is basically the following:

Let f and g be rows from the matrix.

  1. Construct x = f + i * g
  2. Transform to get F(x) = F(f) + i * F(g)
  3. Use symmetries to extract F(f) and F(g) from F(x)

I can't however just input the results directly into the 2nd 1D FFT, because in that case I would not transform the whole matrix, but two submatrices instead. However extracting the data between the transformations means either storing more data (n/2+1 entries needed to express the result of an 1D FFT on real input) or combine the elements at index 0 and index n/2 into one element (combining using the same trick, since both numbers are guaranteed to be real) and use the same amount of storage but have to make a spcial case for that in my convolution.

Since I try to reuse buffers as much as possible (due to limited RAM availible on the gpu) using more storage isn't a nice solution. Furthermore my algorithms are not equipped to work on matrixsizes which are not power of 2 / multiples of 16 (varies from kernel to kernel). I would rather avoid introducing special cases either, since those would make my kernels more complex hurting efficiency (I'm already having trouble to minimize the register count used by each kernel).

So my question is if there is an elegant approach to this problem, meaning one which will work without either using more memory or special cases for certain elements?

Ideally I would like to be able to do the whole FFT without splitting my combined data in the middle of the FFT, but I'm not sure thats possible.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

入怼 2024-10-05 03:00:40

嗯...我的两个参考是:

http://www.engineeringproductivitytools.com/stuff/ T0001/PT10.HTM
http://images.apple.com/acg/pdf/FFTapps_20090909.pdf

我认为致力于“埃尔米特式”数据结构,将 0 和 n/2 值打包到第一个元素中是正确的方法,就像前进一样/inverse 和 Hermitian 结构效果会更好。

这样,就得到了 rUnWrap(FFT(n/2, Even(x) + i*Odd(x)))= rFFT(x),并且 riFFT 可以作用于“埃尔米特”数组,生成一对数组 Even和奇数,这再次给出了原始结构。

还可以进行其他采样,从而将原始数组分解为
4 个 n/2xn/2 数组,以 (0,0),(0,1),(1,0),(1,1) 为根,然后使用最终基数 4 包裹在末尾
通过...也许这对 GPU 内存更好...我不知道。

艾伦

Hmmm...my two references are:

http://www.engineeringproductivitytools.com/stuff/T0001/PT10.HTM
http://images.apple.com/acg/pdf/FFTapps_20090909.pdf

I think that committing to a "hermitian" data structure, with the 0 and n/2 values packed into the first element is the way to go, as forward/inverse and hermitian structures will work out better.

That way, you have rUnWrap(FFT(n/2, Even(x) + i*Odd(x)))= rFFT(x), and the riFFT can work on the "hermitian" array, producing the pair of arrays Even and Odd, which again gives the original structure.

There are also other samplings that can be done, whereby the the original array is broken into
4 n/2xn/2 arrays, rooted at (0,0),(0,1),(1,0),(1,1) and then wrapped up at the end, using a final radix-4
pass...perhaps that is better for the GPU memory...I don't know.

alan

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文