ARM NEON 简单低通滤波器矢量化

发布于 2024-12-25 03:48:00 字数 166 浏览 2 评论 0原文

我有一个简单的单极低通滤波器（用于参数平滑），可以通过以下公式进行解释：

y[n] = (1-a) * y[n-1] + a * x[n]

如何在 ARM Neon 上有效矢量化这种情况 - 使用内在函数？是否可以？问题是每次计算都需要先前的结果。

原文

I have a simple single pole low pass filter (for parameter smoothing) that can be explained by the following formula:

y[n] = (1-a) * y[n-1] + a * x[n]

How to effective vectorize this case on ARM Neon - using intrinsics? Is it possible?
The problem is that every computation need a previous result.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

把时间冻结 2025-01-01 03:48:00

假设您一次执行 M 个元素的向量运算（我认为 NEON 是 128 位宽，因此这将是 M=4 32 位元素），您可以展开对于简单的单极点滤波器来说，很容易将差分方程乘以 M 因子。假设您已经计算了直到 y[n] 的所有输出。然后，您可以按如下方式计算接下来的四个：

y[n+1] = (1-a)*y[n] + a*x[n+1]
y[n+2] = (1-a)*y[n+1] + a*x[n+2] = (1-a)*((1-a)*y[n] + a*x[n+1]) + a*x[n+2]
       = (1-a)^2*y[n] + a*(1-a)*x[n+1] + a*x[n+2]
...

一般来说，您可以将 y[n+k] 写为：

y[n+k] = (1-a)^2*y[n] + sum_{i=1}^k a*(1-a)^{k-i}*x[n+i]

我知道上面的内容很难阅读（也许我们可以将这个问题迁移到 < a href="http://dsp.stackexchange.com">信号处理，我可以在 LaTeX 中重新排版）。但是，给定一个初始条件y[n]（假设是根据前一个计算得出的最后一个输出）
矢量化迭代），您可以并行计算下一个 M 输出，因为展开的滤波器的其余部分具有类似 FIR 的结构。

这种方法有一些注意事项：如果 M 变大，那么您最终需要将一堆数字相乘才能获得展开滤波器的有效 FIR 系数。根据您的数字格式和 a 的值，这可能会影响数字精度。此外，使用这种方法您不会获得 M 倍的加速：您最终会用相当于 k< 的值来计算 y[n+k] /code>-tap FIR 滤波器。尽管您并行计算 M 输出，但您必须执行 k 乘法累加运算而不是简单的一阶递归实现，这一事实削弱了一些优势到矢量化。

Assuming that you perform vector operations M elements at a time (I think NEON is 128 bits wide, so that would be M=4 32-bit elements), you can unroll the difference equation by a factor of M pretty easily for the simple single-pole filter. Assume that you have already calculated all outputs up to y[n]. Then, you can calculate the next four as follows:

y[n+1] = (1-a)*y[n] + a*x[n+1]
y[n+2] = (1-a)*y[n+1] + a*x[n+2] = (1-a)*((1-a)*y[n] + a*x[n+1]) + a*x[n+2]
       = (1-a)^2*y[n] + a*(1-a)*x[n+1] + a*x[n+2]
...

In general, you can write y[n+k] as:

y[n+k] = (1-a)^2*y[n] + sum_{i=1}^k a*(1-a)^{k-i}*x[n+i]

I know the above is difficult to read (maybe we can migrate this question over to Signal Processing and I can re-typeset in LaTeX). But, given an initial condition y[n] (which is assumed to be the last output calculated on the previous
vectorized iteration), you can calculate the next M outputs in parallel, as the rest of the unrolled filter has an FIR-like structure.

There are some caveats to this approach: if M becomes large, then you end up multiplying a bunch of numbers together in order to get the effective FIR coefficients for the unrolled filters. Depending upon your number format and the value of a, this could have numerical precision implications. Also, you don't get an M-fold speedup with this approach: you end up calculating y[n+k] with what amounts to a k-tap FIR filter. Although you're calculating M outputs in parallel, the fact that you have to do k multiply-accumulate operations instead of the simple first-order recursive implementation diminishes some of the benefit to vectorization.

回复收藏 0 原文