ARM NEON 简单低通滤波器矢量化
我有一个简单的单极低通滤波器(用于参数平滑),可以通过以下公式进行解释:
y[n] = (1-a) * y[n-1] + a * x[n]
如何在 ARM Neon 上有效矢量化这种情况 - 使用内在函数?是否可以? 问题是每次计算都需要先前的结果。
I have a simple single pole low pass filter (for parameter smoothing) that can be explained by the following formula:
y[n] = (1-a) * y[n-1] + a * x[n]
How to effective vectorize this case on ARM Neon - using intrinsics? Is it possible?
The problem is that every computation need a previous result.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
假设您一次执行
M
个元素的向量运算(我认为 NEON 是 128 位宽,因此这将是M=4
32 位元素),您可以展开对于简单的单极点滤波器来说,很容易将差分方程乘以M
因子。假设您已经计算了直到y[n]
的所有输出。然后,您可以按如下方式计算接下来的四个:一般来说,您可以将
y[n+k]
写为:我知道上面的内容很难阅读(也许我们可以将这个问题迁移到 < a href="http://dsp.stackexchange.com">信号处理,我可以在 LaTeX 中重新排版)。但是,给定一个初始条件
y[n]
(假设是根据前一个计算得出的最后一个输出)矢量化迭代),您可以并行计算下一个
M
输出,因为展开的滤波器的其余部分具有类似 FIR 的结构。这种方法有一些注意事项:如果 M 变大,那么您最终需要将一堆数字相乘才能获得展开滤波器的有效 FIR 系数。根据您的数字格式和
a
的值,这可能会影响数字精度。此外,使用这种方法您不会获得 M 倍的加速:您最终会用相当于k< 的值来计算
y[n+k]
/code>-tap FIR 滤波器。尽管您并行计算M
输出,但您必须执行k
乘法累加运算而不是简单的一阶递归实现,这一事实削弱了一些优势到矢量化。Assuming that you perform vector operations
M
elements at a time (I think NEON is 128 bits wide, so that would beM=4
32-bit elements), you can unroll the difference equation by a factor ofM
pretty easily for the simple single-pole filter. Assume that you have already calculated all outputs up toy[n]
. Then, you can calculate the next four as follows:In general, you can write
y[n+k]
as:I know the above is difficult to read (maybe we can migrate this question over to Signal Processing and I can re-typeset in LaTeX). But, given an initial condition
y[n]
(which is assumed to be the last output calculated on the previousvectorized iteration), you can calculate the next
M
outputs in parallel, as the rest of the unrolled filter has an FIR-like structure.There are some caveats to this approach: if
M
becomes large, then you end up multiplying a bunch of numbers together in order to get the effective FIR coefficients for the unrolled filters. Depending upon your number format and the value ofa
, this could have numerical precision implications. Also, you don't get anM
-fold speedup with this approach: you end up calculatingy[n+k]
with what amounts to ak
-tap FIR filter. Although you're calculatingM
outputs in parallel, the fact that you have to dok
multiply-accumulate operations instead of the simple first-order recursive implementation diminishes some of the benefit to vectorization.只有当您希望对多个信号应用相同的滤波器时,您才能真正对其进行矢量化,例如,如果它是立体声音频信号,那么您可以并行处理左声道和右声道。并行四个或八个通道显然会更好。
You can only really vectorize this if you have more than one signal to which you wish to apply the same filter, e.g. if it's a stereo audio signal then you can process the left and right channel in parallel. Four or eight channels in parallel would obviously be even better.
一般来说,您只能对完全独立的计算集进行矢量化。但在 IIR 低通中,每个输出都依赖于另一个输出(第一个输出除外),因此矢量化是不可能的。
如果变量“a”足够大,以至于 (1-a)^n 快速衰减到所需的本底噪声或允许的误差以下,则可以用短 FIR 滤波器近似值替换 IIR,并对该卷积进行矢量化。但这不太可能更快。
In general, you can only vectorize completely independent sets of computations. But in your IIR low pass, every output is dependent on another (except the 1st), so vectorization is not possible.
If your variable "a" is large enough that (1-a)^n quickly decays to below your desired noise floor or allowed error, you could substitute a short FIR filter approximation for your IIR, and vectorize that convolution instead. But that's not likely to be faster.
将方程扩展到 4 步并使用矩阵乘法怎么样? a 是常数,因此可以预先计算一个矩阵
How about expanding equations to 4 steps and use matrix multiplication? a is constant so one matrix may be precalculated