从连续麦克风流中实时识别非语音、非音乐声音

发布于 2024-12-18 03:38:06 字数 603 浏览 2 评论 0原文

我希望记录与特定声音相对应的事件,例如车门关上的声音,或者烤面包机吐司的声音。

该系统需要比“噪音检测器”更复杂;它需要能够区分特定的声音和其他响亮的噪音。

识别不需要是零延迟,但处理器需要跟上来自始终打开的麦克风的连续传入数据流。

  • 这项任务与语音识别有显着不同吗?或者我可以利用语音识别库/工具包来识别这些非语音吗?
  • 鉴于我只需要匹配一种声音(而不是在声音库之间进行匹配)的要求,我可以做任何特殊的优化吗?

此答案表示匹配的过滤器是合适的,但我对细节很模糊。由于目标声音的变化,我认为目标声音样本和麦克风流之间的音频波形数据的简单互相关不会有效。

我的问题也类似于this,但没有得到太多注意力。

I'm looking to log events corresponding to a specific sound, such as a car door slamming, or perhaps a toaster ejecting toast.

The system needs to be more sophisticated than a "loud noise detector"; it needs to be able to distinguish that specific sound from other loud noises.

The identification need not be zero-latency, but the processor needs to keep up with a continuous stream of incoming data from a microphone that is always on.

  • Is this task significantly different than speech recognition, or could I make use of speech recognition libraries/toolkits to identify these non-speech sounds?
  • Given the requirement that I only need to match one sound (as opposed to matching among a library of sounds), are there any special optimizations I can do?

This answer indicates that a matched filter would be appropriate, but I am hazy on the details. I don't believe a simple cross-correlation on the audio waveform data between a sample of the target sound and the microphone stream would be effective, due to variations in the target sound.

My question is also similar to this, which didn't get much attention.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

指尖微凉心微凉 2024-12-25 03:38:06

我发现了一篇关于该主题的有趣论文

它也应该适用于您的应用程序,即使不比车辆声音更好。

分析训练数据时,它...

  1. 采集 200ms 的样本
  2. 对每个样本进行傅立叶变换 (FFT)
  3. 执行 频率向量的主成分分析

    • 计算此类所有样本的平均值
    • 从样本中减去平均值
    • 计算平均协方差矩阵的特征向量(每个向量与其自身的外积的平均值)
    • 存储平均值和最重要的特征向量。

然后,为了对声音进行分类,它...

  1. 采集 200 毫秒 (S) 的样本。
  2. 对每个样本进行傅里叶变换。
  3. 从频率向量 (F) 中减去类别的平均值 (C)。
  4. 将频率向量与 C 的每个特征向量相乘,给出每个特征向量的数字。
  5. 从 F 中减去每个数字与相应特征向量的乘积。
  6. 获取结果向量的长度。
  7. 如果该值低于某个常数,则 S 被认为属于 C 类。

I found an interesting paper on the subject

It should also work for your application, if not better than for vehicle sounds.

When analyzing the training data, it...

  1. Takes samples of 200ms
  2. Does a Fourier Transform (FFT) on each sample
  3. Does a Principal Component Analysis on the frequency vectors

    • Calculates the mean of all samples of this class
    • Subtracts the mean from the samples
    • Calculates the eigen-vectors of the mean covariance matrix (mean of the outer products of each vector with itself)
    • Stores the mean and the most significant eigen-vectors.

Then to classify a sound, it...

  1. Takes samples of 200ms (S).
  2. Does a Fourier Transform on each sample.
  3. Subtracts the mean of the class (C) from the frequency vector (F).
  4. Multiplies the frequency vector with each eigen-vector of C, giving a number from each.
  5. Subtracts the product of each number and the corresponding eigen-vector from F.
  6. Takes the length of the resulting vector.
  7. If this value is below some constant, S is recognized as belonging to the class C.
栖竹 2024-12-25 03:38:06

本博士论文,非语音环境用于自主监视的声音分类系统,作者:Cowling (2004),有实验结果关于音频特征提取和分类的不同技术。他使用环境声音,例如叮当作响的钥匙声和脚步声,能够达到 70% 的准确率:

发现最好的技术是连续小波变换
使用动态时间规整或梅尔频率倒谱进行特征提取
动态时间扭曲系数。这两种技术
识别率达到70%。

如果你限制自己只听一种声音,或许你能达到更高的识别率?

作者还提到,在语音识别(学习矢量量化和神经网络)方面效果很好的技术在环境声音方面效果不佳。

我还在这里找到了一篇更新的文章:检测语义视频搜索的音频事件,作者:Bugalho 等人。 (2009),他们检测电影中的声音事件(如枪声、爆炸等)。

我没有这方面的经验。我只是因为你的问题引起了我的兴趣而偶然发现了这份材料。我将我的发现发布在这里,希望对您的研究有所帮助。

This doctoral thesis, Non-Speech Environmental Sound Classification System for Autonomous Surveillance, by Cowling (2004), has experimental results on different techniques for audio feature extraction, as well as classification. He uses environmental sounds such as jangling keys and footsteps, and was able to achieve an accuracy of 70%:

The best technique is found to be either Continuous Wavelet Transform
feature extraction with Dynamic Time Warping or Mel-Frequency Cepstral
Coefficients with Dynamic Time Warping. Both of these techniques
achieve a 70% recognition rate.

If you limit yourself to one sound, perhaps you might be able to achieve a higher recognition rate?

The author also mentions that techniques that work fairly well with speech recognition (learning vector quantization and neural networks) don't work so well with environmental sounds.

I have also found a more recent article here: Detecting Audio Events for Semantic Video Search, by Bugalho et al. (2009), where they detect sound events in movies (like gun shots, explosions, etc).

I have no experience in this area. I have merely stumbled upon this material as a result of your question piquing my interest. I'm posting my finds here in the hope that it helps with your research.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文