提高 FFT 实现速度
我是编程初学者,目前正在尝试从事一个需要实现快速傅里叶变换的项目。
到目前为止,我已经成功实现了以下目标:
是否有人有任何替代方案和建议来提高程序的速度而不失去准确性。
short FFTMethod::FFTcalc(short int dir,long m,double *x,double *y)
{
long n,i,i1,j,k,i2,l,l1,l2;
double c1,c2,tx,ty,t1,t2,u1,u2,z;
/* Calculate the number of points */
n = 1;
for (i=0;i<m;i++)
n *= 2;
/* Do the bit reversal */
i2 = n >> 1;
j = 0;
for (i=0;i<n-1;i++) {
if (i < j) {
tx = x[i];
ty = y[i];
x[i] = x[j];
y[i] = y[j];
x[j] = tx;
y[j] = ty;
}
k = i2;
while (k <= j) {
j -= k;
k >>= 1;
}
j += k;
}
/* Compute the FFT */
c1 = -1.0;
c2 = 0.0;
l2 = 1;
for (l=0;l<m;l++) {
l1 = l2;
l2 <<= 1;
u1 = 1.0;
u2 = 0.0;
for (j=0;j<l1;j++) {
for (i=j;i<n;i+=l2) {
i1 = i + l1;
t1 = u1 * x[i1] - u2 * y[i1];
t2 = u1 * y[i1] + u2 * x[i1];
x[i1] = x[i] - t1;
y[i1] = y[i] - t2;
x[i] += t1;
y[i] += t2;
}
z = u1 * c1 - u2 * c2;
u2 = u1 * c2 + u2 * c1;
u1 = z;
}
c2 = sqrt((1.0 - c1) / 2.0);
if (dir == 1)
c2 = -c2;
c1 = sqrt((1.0 + c1) / 2.0);
}
/* Scaling for forward transform */
if (dir == 1) {
for (i=0;i<n;i++) {
x[i] /= n;
y[i] /= n;
}
}
return(1);
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我最近在 Eric Postpischil 的高性能 FFT 的构建中发现了这份出色的 PDF。我自己开发了几个 FFT,我知道与商业图书馆竞争有多么困难。相信我,如果您的 FFT 仅比 Intel 或 FFTW 慢 4 倍,而不是 40 倍,那么您就做得很好了!不过,您可以竞争,具体方法如下。
总结那篇文章,作者指出 Radix2 FFT 简单但效率低下,最有效的结构是 radix4 FFT。更有效的方法是 Radix8,但这通常不适合 CPU 上的寄存器,因此首选 Radix4。
FFT 可以分阶段构建,因此要计算 1024 点 FFT,您可以执行 Radix2 FFT 的 10 个阶段(如 2^10 - 1024),或 Radix4 FFT 的 5 个阶段(4^5 = 1024)。如果您愿意,您甚至可以按 8*4*4*4*2 的阶段计算 1024 点 FFT。更少的阶段意味着更少的内存读写(FFT 性能的瓶颈是内存带宽),因此动态选择基数 4、8 或更高是必须的。 Radix4 阶段特别高效,因为所有权重均为 1+0i、0+1i、-1+0i、0-1i,并且可以编写 Radix4 蝴蝶代码以完全适合缓存。
其次,FFT中的各个阶段并不相同。第一阶段权重均等于1+0i。计算这个权重甚至乘以它都是没有意义的,因为它是乘以 1 的复数,因此第一阶段可以在没有权重的情况下执行。最后阶段也可以进行不同的处理,并且可以用于执行时间抽取(位反转)。 Eric Postpischil 的文档涵盖了这一切。
权重可以预先计算并存储在表中。在 x86 硬件上,正弦/余弦计算每次需要大约 100-150 个周期,因此预计算这些可以节省 10-20% 的总体计算时间,因为在这种情况下,内存访问比 CPU 计算更快。使用快速算法一次性计算正余弦特别有益(请注意,cos 等于 sqrt(1.0 - sine*sine),或者使用表查找,cos 只是正弦的相移)。
最后,一旦您实现了超级简化的 FFT 实现,您就可以利用 SIMD 矢量化在蝶形例程中的每个周期计算 4x 浮点或 2x 双浮点运算,从而将速度再提高 100-300%。综合上述所有内容,您将拥有一个非常流畅且快速的 FFT!
为了更进一步,您可以通过提供针对特定处理器架构的 FFT 阶段的不同实现来动态执行优化。缓存大小、寄存器数量、SSE/SSE2/3/4 指令集等因机器而异,因此选择一种适合所有情况的方法通常会被目标例程击败。例如,在 FFTW 中,许多较小尺寸的 FFT 是针对特定架构的高度优化的展开(无循环)实现。通过组合这些较小的构造(例如 RadixN 例程),您可以为手头的任务选择最快且最佳的例程。
I recently found this excellent PDF on the Construction of a high performance FFTs by Eric Postpischil. Having developed several FFTs myself I know how hard it is to compete with commercial libraries. Believe me you're doing well if your FFT is only 4x slower than Intel or FFTW, not 40x! You can however compete, and here's how.
To summarise that article, the author states that Radix2 FFTs are simple but inefficient, the most efficient construct is the radix4 FFT. An even more efficient method is the Radix8 however this often does not fit into the registers on a CPU so Radix4 is preferred.
FFTs can be constructed in stages, so to compute a 1024 point FFT you could perform 10 stages of the Radix2 FFT (as 2^10 - 1024), or 5 stages of the Radix4 FFT (4^5 = 1024). You could even compute a 1024 point FFT in stages of 8*4*4*4*2 if you so choose. Fewer stages means fewer reads and writes to memory (the bottleneck for FFT performance is memory bandwidth) hence dynamically choosing radix 4, 8 or higher is a must. The Radix4 stage is particulary efficient as all weights are 1+0i, 0+1i, -1+0i, 0-1i and Radix4 butterfly code can be written to fit entirely in the cache.
Secondly, each stage in the FFT is not the same. The first stage the weights are all equal to 1+0i. there is no point computing this weight and even multiplying by it as it is a complex multiply by 1, so the first stage may be performed without weights. The final stage may also be treated differently and can be used to perform the Decimation in Time (bit reversal). Eric Postpischil's document covers all this.
The weights may be precomputed and stored in a table. Sin/cos calculations take around 100-150 cycles each on x86 hardware so precomputing these can save 10-20% of the overall compute time as memory access is in this case faster than CPU calculations. Using fast algorithms to compute sincos in one go is particularly beneficial (Note that cos is equal to sqrt(1.0 - sine*sine), or using table lookups, cos is just a phase shift of sine).
Finally once you have your super streamlined FFT implementation you can utilise SIMD vectorization to compute 4x floating point or 2x double floating point operations per cycle inside the butterfly routine for another 100-300% speed improvement. Taking all of the above you'd have yourself a pretty slick and fast FFT!
To go further you can perform optimisation on the fly by providing different implementations of the FFT stages targeted to specific processor architectures. Cache size, register count, SSE/SSE2/3/4 instruction sets etc differ per machine so choosing a one size fits all approach is often beaten by targeted routines. In FFTW for instance many smaller size FFTs are highly optimised unrolled (no loops) implementations targeted for a specific architecture. By combining these smaller constructs (such as RadixN routines) you can choose the fastest and best routine for the task at hand.
虽然我现在无法给您性能提示,但我想为您的优化提供一些建议,这些建议对于评论来说太长了:
While I can't give you a performance hint right now, I'd like to give some advice for your optimization that is too long for a comment:
for (i=j;i<n;i+=l2) {
, seeing is better than believing.我可以建议尝试以下几件事:
当然,最终的答案可以通过分析代码来找到。
There are several things I can recommend trying:
The ultimate answer can, of course, be found by profiling the code.
这看起来是来自旧教科书的基本 radix-2 FFT 实现。有许多几十年前的论文以各种方式优化 FFT,具体取决于许多因素。例如,你的数据是否小于CPU缓存?
添加:例如,如果数据向量加上系数表适合 CPU dcache 和/或如果乘法比 CPU 上的内存访问慢得多,则预先计算旋转因子表可能会减少重复使用的总周期数FFT 的。但如果不是,预计算实际上可能会更慢。基准。 YMMV。
This looks a basic radix-2 FFT implementation right out of an old textbook. There are many dozens of decades-old papers on optimizing FFTs in various ways, depending on many factors. For instance, is your data smaller than the CPU cache?
Added: For instance, if the data vector plus a table of coefficients will fit into CPU dcache and/or if multiplies are much slower than memory accesses on your CPU, then precomputing a table of twiddle factors may reduce the total cycle count for repeated use of the FFT. But if not, precomputing might actually be slower. Benchmark. YMMV.