音频活动检测最简单快速的方法?
给出的是一个包含 320 个元素的数组 (int16),它们表示持续时间为 20 ms 的音频信号(16 位 LPCM)。我正在寻找一种最简单且非常快速的方法,该方法应该确定该数组是否包含活动音频(如语音或音乐),但不包含噪音或静音。我不需要很高的决策质量,但必须非常快。
我首先想到将元素的所有平方或绝对值相加,并将它们的总和与阈值进行比较,但这种方法在我的系统上非常慢,即使它是O(n)。
Given is an array of 320 elements (int16), which represent an audio signal (16-bit LPCM) of 20 ms duration. I am looking for a most simple and very fast method which should decide whether this array contains active audio (like speech or music), but not noise or silence. I don't need a very high quality of the decision, but it must be very fast.
It occurred to me first to add all squares or absolute values of the elements and compare their sum with a threshold, but such a method is very slow on my system, even if it is O(n).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
你不会比平方和方法更快。
到目前为止,您可能尚未执行的一项优化是使用运行总计。也就是说,在每个时间步长中,不是对最后 n 个样本的平方求和,而是保留运行总计并用最近样本的平方进行更新。为了避免运行总计随着时间的推移不断增长,请添加指数衰减。用伪代码表示:
当然,您必须调整衰减常数和阈值以适合您的应用程序。如果这还不够快,无法实时运行,那么您的 DSP严重 性能不足......
You're not going to get much faster than a sum-of-squares approach.
One optimization that you may not be doing so far is to use a running total. That is, in each time step, instead of summing the squares of the last n samples, keep a running total and update that with the square of the most recent sample. To avoid your running total from growing and growing over time, add an exponential decay. In pseudocode:
Of course, you'll have to tune the decay constant and threshold to suit your application. If this isn't fast enough to run in real time, you have a seriously underpowered DSP...
您可以尝试计算两个简单的“统计数据” - 首先是分布(最大-最小)。沉默的传播范围非常小。其次是多样性 - 将可能值的范围划分为 16 个括号(= 值范围),然后在浏览元素时确定该元素属于哪个括号。噪音对于所有括号都有相似的数字,而音乐或语音应该更喜欢其中的一些,而忽略其他的。
这应该可以在一次遍历数组的情况下完成,并且不需要复杂的算术,只需要对值进行一些加法和比较。
还要考虑一些近似值,例如仅取每个第四个值,从而将检查的元素数量减少到 80。对于音频信号,这应该没问题。
You might try calculating two simple "statistics" - first would be spread (max-min). Silence will have very low spread. Second would be variety - divide the range of possible values into say 16 brackets (= value range) and as you go through the elements, determine in which bracket that element goes. Noise will have similar numbers for all brackets, whereas music or speech should prefer some of them while neglecting others.
This should be possible to do in just one pass through the array and you do not need complicated arithmetics, just some addition and comparison of values.
Also consider some approximation, for example take only each fourth value, thus reducing the number of checked elements to 80. For audio signal, this should be okay.
不久前我做了这样的事情。经过一些实验,我得出了一个在我的案例中运行良好的解决方案。
我使用了大约 120 毫秒内运行平均值的立方变化率。当没有声音(只有噪音)时,表达式应该在零附近徘徊。一旦比率在几次运行中开始增加,您可能会采取一些行动。
我使用了立方体,因为正方形不够有攻击性。如果立方体对你来说太慢,请尝试使用平方和位移位。希望这有帮助。
I did something like this a while back. After some experimentation I arrived at a solution that worked sufficiently well in my case.
I used the rate of change in the cube of the running average over about 120ms. When there is silence (only noise that is) the expression should be hovering around zero. As soon as the rate starts increasing over a couple of runs, you probably have some action going on.
I used a cube because the square just wasn't agressive enough. If the cube is to slow for you, try using the square and a bitshift instead. Hope this helps.
显然,复杂度至少应为 O(n)。也许一些计算某些值范围的简单算法目前很好用,但我会寻找 语音活动检测 网络上以及相关代码样本。
It is clearly that the complexity should be at least O(n). Probably some simple algorithms that calculate some value range are good for the moment but I would look for Voice Activity Detection on web and for related code samples.