提高 FFT 实现速度

发布于 2024-12-22 09:51:27 字数 1333 浏览 0 评论 0 原文

我是编程初学者,目前正在尝试从事一个需要实现快速傅里叶变换的项目。

到目前为止,我已经成功实现了以下目标:

是否有人有任何替代方案和建议来提高程序的速度而不失去准确性。

short FFTMethod::FFTcalc(short int dir,long m,double *x,double *y)
{
long n,i,i1,j,k,i2,l,l1,l2;
double c1,c2,tx,ty,t1,t2,u1,u2,z;

/* Calculate the number of points */
n = 1;
for (i=0;i<m;i++) 
    n *= 2;

/* Do the bit reversal */
i2 = n >> 1;
j = 0;
for (i=0;i<n-1;i++) {
  if (i < j) {
     tx = x[i];
     ty = y[i];
     x[i] = x[j];
     y[i] = y[j];
     x[j] = tx;
     y[j] = ty;
  }
  k = i2;
  while (k <= j) {
     j -= k;
     k >>= 1;
  }
  j += k;
}

/* Compute the FFT */
c1 = -1.0; 
c2 = 0.0;
l2 = 1;
for (l=0;l<m;l++) {
   l1 = l2;
   l2 <<= 1;
   u1 = 1.0; 
   u2 = 0.0;
   for (j=0;j<l1;j++) {
     for (i=j;i<n;i+=l2) {
        i1 = i + l1;
        t1 = u1 * x[i1] - u2 * y[i1];
        t2 = u1 * y[i1] + u2 * x[i1];
        x[i1] = x[i] - t1; 
        y[i1] = y[i] - t2;
        x[i] += t1;
        y[i] += t2;
     }
     z =  u1 * c1 - u2 * c2;
     u2 = u1 * c2 + u2 * c1;
     u1 = z;
   }
   c2 = sqrt((1.0 - c1) / 2.0);
   if (dir == 1) 
     c2 = -c2;
     c1 = sqrt((1.0 + c1) / 2.0);
  }

/* Scaling for forward transform */
if (dir == 1) {
   for (i=0;i<n;i++) {
      x[i] /= n;
      y[i] /= n;
   }
 } 


   return(1);
}

I'm a beginner in programming and am currently trying to work on a project requiring Fast Fourier Transform implementation.

I have so far managed to implement the following:

Does anyone have any alternatives and suggestions to improve the speed of the program without losing out on accuracy.

short FFTMethod::FFTcalc(short int dir,long m,double *x,double *y)
{
long n,i,i1,j,k,i2,l,l1,l2;
double c1,c2,tx,ty,t1,t2,u1,u2,z;

/* Calculate the number of points */
n = 1;
for (i=0;i<m;i++) 
    n *= 2;

/* Do the bit reversal */
i2 = n >> 1;
j = 0;
for (i=0;i<n-1;i++) {
  if (i < j) {
     tx = x[i];
     ty = y[i];
     x[i] = x[j];
     y[i] = y[j];
     x[j] = tx;
     y[j] = ty;
  }
  k = i2;
  while (k <= j) {
     j -= k;
     k >>= 1;
  }
  j += k;
}

/* Compute the FFT */
c1 = -1.0; 
c2 = 0.0;
l2 = 1;
for (l=0;l<m;l++) {
   l1 = l2;
   l2 <<= 1;
   u1 = 1.0; 
   u2 = 0.0;
   for (j=0;j<l1;j++) {
     for (i=j;i<n;i+=l2) {
        i1 = i + l1;
        t1 = u1 * x[i1] - u2 * y[i1];
        t2 = u1 * y[i1] + u2 * x[i1];
        x[i1] = x[i] - t1; 
        y[i1] = y[i] - t2;
        x[i] += t1;
        y[i] += t2;
     }
     z =  u1 * c1 - u2 * c2;
     u2 = u1 * c2 + u2 * c1;
     u1 = z;
   }
   c2 = sqrt((1.0 - c1) / 2.0);
   if (dir == 1) 
     c2 = -c2;
     c1 = sqrt((1.0 + c1) / 2.0);
  }

/* Scaling for forward transform */
if (dir == 1) {
   for (i=0;i<n;i++) {
      x[i] /= n;
      y[i] /= n;
   }
 } 


   return(1);
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

烂人 2024-12-29 09:51:27

我最近在 Eric Postpischil 的高性能 FFT 的构建中发现了这份出色的 PDF。我自己开发了几个 FFT,我知道与商业图书馆竞争有多么困难。相信我,如果您的 FFT 仅比 Intel 或 FFTW 慢 4 倍,而不是 40 倍,那么您就做得很好了!不过,您可以竞争,具体方法如下。

总结那篇文章,作者指出 Radix2 FFT 简单但效率低下,最有效的结构是 radix4 FFT。更有效的方法是 Radix8,但这通常不适合 CPU 上的寄存器,因此首选 Radix4。

FFT 可以分阶段构建,因此要计算 1024 点 FFT,您可以执行 Radix2 FFT 的 10 个阶段(如 2^10 - 1024),或 Radix4 FFT 的 5 个阶段(4^5 = 1024)。如果您愿意,您甚至可以按 8*4*4*4*2 的阶段计算 1024 点 FFT。更少的阶段意味着更少的内存读写(FFT 性能的瓶颈是内存带宽),因此动态选择基数 4、8 或更高是必须的。 Radix4 阶段特别高效,因为所有权重均为 1+0i、0+1i、-1+0i、0-1i,并且可以编写 Radix4 蝴蝶代码以完全适合缓存。

其次,FFT中的各个阶段并不相同。第一阶段权重均等于1+0i。计算这个权重甚至乘以它都是没有意义的,因为它是乘以 1 的复数,因此第一阶段可以在没有权重的情况下执行。最后阶段也可以进行不同的处理,并且可以用于执行时间抽取(位反转)。 Eric Postpischil 的文档涵盖了这一切。

权重可以预先计算并存储在表中。在 x86 硬件上,正弦/余弦计算每次需要大约 100-150 个周期,因此预计算这些可以节省 10-20% 的总体计算时间,因为在这种情况下,内存访问比 CPU 计算更快。使用快速算法一次性计算正余弦特别有益(请注意,cos 等于 sqrt(1.0 - sine*sine),或者使用表查找,cos 只是正弦的相移)。

最后,一旦您实现了超级简化的 FFT 实现,您就可以利用 SIMD 矢量化在蝶形例程中的每个周期计算 4x 浮点或 2x 双浮点运算,从而将速度再提高 100-300%。综合上述所有内容,您将拥有一个非常流畅且快速的 FFT!

为了更进一步,您可以通过提供针对特定处理器架构的 FFT 阶段的不同实现来动态执行优化。缓存大小、寄存器数量、SSE/SSE2/3/4 指令集等因机器而异,因此选择一种适合所有情况的方法通常会被目标例程击败。例如,在 FFTW 中,许多较小尺寸的 FFT 是针对特定架构的高度优化的展开(无循环)实现。通过组合这些较小的构造(例如 RadixN 例程),您可以为手头的任务选择最快且最佳的例程。

I recently found this excellent PDF on the Construction of a high performance FFTs by Eric Postpischil. Having developed several FFTs myself I know how hard it is to compete with commercial libraries. Believe me you're doing well if your FFT is only 4x slower than Intel or FFTW, not 40x! You can however compete, and here's how.

To summarise that article, the author states that Radix2 FFTs are simple but inefficient, the most efficient construct is the radix4 FFT. An even more efficient method is the Radix8 however this often does not fit into the registers on a CPU so Radix4 is preferred.

FFTs can be constructed in stages, so to compute a 1024 point FFT you could perform 10 stages of the Radix2 FFT (as 2^10 - 1024), or 5 stages of the Radix4 FFT (4^5 = 1024). You could even compute a 1024 point FFT in stages of 8*4*4*4*2 if you so choose. Fewer stages means fewer reads and writes to memory (the bottleneck for FFT performance is memory bandwidth) hence dynamically choosing radix 4, 8 or higher is a must. The Radix4 stage is particulary efficient as all weights are 1+0i, 0+1i, -1+0i, 0-1i and Radix4 butterfly code can be written to fit entirely in the cache.

Secondly, each stage in the FFT is not the same. The first stage the weights are all equal to 1+0i. there is no point computing this weight and even multiplying by it as it is a complex multiply by 1, so the first stage may be performed without weights. The final stage may also be treated differently and can be used to perform the Decimation in Time (bit reversal). Eric Postpischil's document covers all this.

The weights may be precomputed and stored in a table. Sin/cos calculations take around 100-150 cycles each on x86 hardware so precomputing these can save 10-20% of the overall compute time as memory access is in this case faster than CPU calculations. Using fast algorithms to compute sincos in one go is particularly beneficial (Note that cos is equal to sqrt(1.0 - sine*sine), or using table lookups, cos is just a phase shift of sine).

Finally once you have your super streamlined FFT implementation you can utilise SIMD vectorization to compute 4x floating point or 2x double floating point operations per cycle inside the butterfly routine for another 100-300% speed improvement. Taking all of the above you'd have yourself a pretty slick and fast FFT!

To go further you can perform optimisation on the fly by providing different implementations of the FFT stages targeted to specific processor architectures. Cache size, register count, SSE/SSE2/3/4 instruction sets etc differ per machine so choosing a one size fits all approach is often beaten by targeted routines. In FFTW for instance many smaller size FFTs are highly optimised unrolled (no loops) implementations targeted for a specific architecture. By combining these smaller constructs (such as RadixN routines) you can choose the fastest and best routine for the task at hand.

远昼 2024-12-29 09:51:27

虽然我现在无法给您性能提示,但我想为您的优化提供一些建议,这些建议对于评论来说太长了:

  1. 如果您还没有这样做,请为您的代码编写一些正确性测试现在。像“对此数组进行 FFT 并查看结果是否与我提供的结果匹配”之类的简单测试就足够了,但在优化代码之前,您需要一个可靠且自动化的单元测试来确认优化后的代码是正确的。
  2. 然后分析您的代码以查看实际瓶颈在哪里。虽然我怀疑最里面的循环是 for (i=j;i,但眼见胜于相信。

While I can't give you a performance hint right now, I'd like to give some advice for your optimization that is too long for a comment:

  1. If you haven't done so, write a number of correctness tests for your code right now. Simple tests like "do an FFT of this array and see if the results match the ones I've provided" suffice, but before you optimize code, you need a firm and automated unit test that confirms your optimized code is correct.
  2. Then profile your code to see where the actual bottleneck is. While I suspect the innermost loop for (i=j;i<n;i+=l2) {, seeing is better than believing.
穿越时光隧道 2024-12-29 09:51:27

我可以建议尝试以下几件事:

  1. 不要交换输入元素,而是计算位反转索引。这将为您节省大量内存读写次数。
  2. 如果您要进行多次相同大小的 FFT,请预先计算系数。这将节省一些计算。
  3. 使用 radix-4 FFT 代替 radix-2。这将导致内部循环中的迭代次数减少。

当然,最终的答案可以通过分析代码来找到。

There are several things I can recommend trying:

  1. Don't swap the input elements, instead calculate the bit-reversed index. This will save you a number of memory reads and writes.
  2. Precalculate the coefficients if you're doing many FFTs of the same size. This will save some computations.
  3. Use radix-4 FFT instead of radix-2. This will result in fewer iterations in the inner loops.

The ultimate answer can, of course, be found by profiling the code.

等待我真够勒 2024-12-29 09:51:27

这看起来是来自旧教科书的基本 radix-2 FFT 实现。有许多几十年前的论文以各种方式优化 FFT,具体取决于许多因素。例如,你的数据是否小于CPU缓存?

添加:例如,如果数据向量加上系数表适合 CPU dcache 和/或如果乘法比 CPU 上的内存访问慢得多,则预先计算旋转因子表可能会减少重复使用的总周期数FFT 的。但如果不是,预计算实际上可能会更慢。基准。 YMMV。

This looks a basic radix-2 FFT implementation right out of an old textbook. There are many dozens of decades-old papers on optimizing FFTs in various ways, depending on many factors. For instance, is your data smaller than the CPU cache?

Added: For instance, if the data vector plus a table of coefficients will fit into CPU dcache and/or if multiplies are much slower than memory accesses on your CPU, then precomputing a table of twiddle factors may reduce the total cycle count for repeated use of the FFT. But if not, precomputing might actually be slower. Benchmark. YMMV.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文