ARM NEON 简单低通滤波器矢量化

发布于 2024-12-25 03:48:00 字数 166 浏览 2 评论 0原文

我有一个简单的单极低通滤波器(用于参数平滑),可以通过以下公式进行解释:

y[n] = (1-a) * y[n-1] + a * x[n]

如何在 ARM Neon 上有效矢量化这种情况 - 使用内在函数?是否可以? 问题是每次计算都需要先前的结果。

I have a simple single pole low pass filter (for parameter smoothing) that can be explained by the following formula:

y[n] = (1-a) * y[n-1] + a * x[n]

How to effective vectorize this case on ARM Neon - using intrinsics? Is it possible?
The problem is that every computation need a previous result.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

把时间冻结 2025-01-01 03:48:00

假设您一次执行 M 个元素的向量运算(我认为 NEON 是 128 位宽,因此这将是 M=4 32 位元素),您可以展开对于简单的单极点滤波器来说,很容易将差分方程乘以 M 因子。假设您已经计算了直到 y[n] 的所有输出。然后,您可以按如下方式计算接下来的四个:

y[n+1] = (1-a)*y[n] + a*x[n+1]
y[n+2] = (1-a)*y[n+1] + a*x[n+2] = (1-a)*((1-a)*y[n] + a*x[n+1]) + a*x[n+2]
       = (1-a)^2*y[n] + a*(1-a)*x[n+1] + a*x[n+2]
...

一般来说,您可以将 y[n+k] 写为:

y[n+k] = (1-a)^2*y[n] + sum_{i=1}^k a*(1-a)^{k-i}*x[n+i]

我知道上面的内容很难阅读(也许我们可以将这个问题迁移到 < a href="http://dsp.stackexchange.com">信号处理,我可以在 LaTeX 中重新排版)。但是,给定一个初始条件y[n](假设是根据前一个计算得出的最后一个输出)
矢量化迭代),您可以并行计算下一个 M 输出,因为展开的滤波器的其余部分具有类似 FIR 的结构。

这种方法有一些注意事项:如果 M 变大,那么您最终需要将一堆数字相乘才能获得展开滤波器的有效 FIR 系数。根据您的数字格式和 a 的值,这可能会影响数字精度。此外,使用这种方法您不会获得 M 倍的加速:您最终会用相当于 k< 的值来计算 y[n+k] /code>-tap FIR 滤波器。尽管您并行计算 M 输出,但您必须执行 k 乘法累加运算而不是简单的一阶递归实现,这一事实削弱了一些优势到矢量化。

Assuming that you perform vector operations M elements at a time (I think NEON is 128 bits wide, so that would be M=4 32-bit elements), you can unroll the difference equation by a factor of M pretty easily for the simple single-pole filter. Assume that you have already calculated all outputs up to y[n]. Then, you can calculate the next four as follows:

y[n+1] = (1-a)*y[n] + a*x[n+1]
y[n+2] = (1-a)*y[n+1] + a*x[n+2] = (1-a)*((1-a)*y[n] + a*x[n+1]) + a*x[n+2]
       = (1-a)^2*y[n] + a*(1-a)*x[n+1] + a*x[n+2]
...

In general, you can write y[n+k] as:

y[n+k] = (1-a)^2*y[n] + sum_{i=1}^k a*(1-a)^{k-i}*x[n+i]

I know the above is difficult to read (maybe we can migrate this question over to Signal Processing and I can re-typeset in LaTeX). But, given an initial condition y[n] (which is assumed to be the last output calculated on the previous
vectorized iteration), you can calculate the next M outputs in parallel, as the rest of the unrolled filter has an FIR-like structure.

There are some caveats to this approach: if M becomes large, then you end up multiplying a bunch of numbers together in order to get the effective FIR coefficients for the unrolled filters. Depending upon your number format and the value of a, this could have numerical precision implications. Also, you don't get an M-fold speedup with this approach: you end up calculating y[n+k] with what amounts to a k-tap FIR filter. Although you're calculating M outputs in parallel, the fact that you have to do k multiply-accumulate operations instead of the simple first-order recursive implementation diminishes some of the benefit to vectorization.

一向肩并 2025-01-01 03:48:00

只有当您希望对多个信号应用相同的滤波器时,您才能真正对其进行矢量化,例如,如果它是立体声音频信号,那么您可以并行处理左声道和右声道。并行四个或八个通道显然会更好。

You can only really vectorize this if you have more than one signal to which you wish to apply the same filter, e.g. if it's a stereo audio signal then you can process the left and right channel in parallel. Four or eight channels in parallel would obviously be even better.

各空 2025-01-01 03:48:00

一般来说,您只能对完全独立的计算集进行矢量化。但在 IIR 低通中,每个输出都依赖于另一个输出(第一个输出除外),因此矢量化是不可能的。

如果变量“a”足够大,以至于 (1-a)^n 快速衰减到所需的本底噪声或允许的误差以下,则可以用短 FIR 滤波器近似值替换 IIR,并对该卷积进行矢量化。但这不太可能更快。

In general, you can only vectorize completely independent sets of computations. But in your IIR low pass, every output is dependent on another (except the 1st), so vectorization is not possible.

If your variable "a" is large enough that (1-a)^n quickly decays to below your desired noise floor or allowed error, you could substitute a short FIR filter approximation for your IIR, and vectorize that convolution instead. But that's not likely to be faster.

桃酥萝莉 2025-01-01 03:48:00

将方程扩展到 4 步并使用矩阵乘法怎么样? a 是常数,因此可以预先计算一个矩阵

How about expanding equations to 4 steps and use matrix multiplication? a is constant so one matrix may be precalculated

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文