Birdsong 音频分析 - 找出两个剪辑的匹配程度
我有大约 100 个 wav 音频文件,采样率为 48000 只同一物种的鸟类,我想测量它们之间的相似性。我从波形文件开始,但我(非常了解)更多有关图像处理的知识,因此我假设我的分析将针对频谱图图像。我有一些不同时期的鸟类样本。
以下是一些数据示例,以及(对未标记的轴表示歉意;x 是样本,y 是线性频率乘以 10,000 Hz 之类的值): 这些鸟鸣声显然出现在“单词”中,即歌曲的不同片段,这可能是我应该比较的水平;相似单词之间的差异以及各种单词的频率和顺序。
我想尝试去除蝉噪声 - 蝉鸣声的频率相当一致,并且倾向于相位匹配,所以这应该不会太难。
似乎一些阈值可能有用。
据我所知,大多数现有文献都使用基于歌曲特征的手动分类,例如潘多拉音乐基因组计划。我想像 Echo Nest 一样;使用自动分类。更新:很多人确实在研究这个。
我的问题是我应该使用什么工具进行此分析?我需要:
- 过滤/阈值化一般噪音并保留音乐
- 过滤掉特定噪音,例如蝉声
- 拆分和分类鸟鸣中的短语、音节和/或音符
- 创建各部分之间差异/相似性的度量; 我选择的武器是 numpy/scipy,但是像 openCV 这样的东西在这里
可能有用吗?
编辑:经过一些研究和史蒂夫的有用回答,更新了我的术语和改写的方法。
I have ~100 wav audio files at sample rate of 48000 of birds of the same species I'd like to measure the similarity between. I'm starting with wave files, but I know (very slightly) more about working with images, so I assume my analysis will be on the spectrogram images. I have several sample of some birds from different days.
Here are some example of the data, along with (apologies for unlabeled axes; x is sample, y is linear frequency times something like 10,000 Hz):
These birdsongs apparently occur in "words", distinct segments of song which is probably the level at which I ought to be comparing; both differences between similar words and the frequency and order of various words.
I want to try to take out cicada noise - cicadas chirp with pretty consistent frequency, and tend to phase-match, so this shouldn't be too hard.
It seems like some thresholding might be useful.
I'm told that most of the existing literature uses manual classification based on song characteristics, like Pandora Music Genome Project. I want to be like Echo Nest; using automatic classification. Update: A lot of people do study this.
My question is what tools should I use for this analysis? I need to:
- Filter/threshold out general noise and keep the music
- Filter out specific noises like of cicadas
- Split and classify phrases, syllables, and/or notes in birdsongs
- Create measures of difference/similarity between parts; something which will pick up differences between birds, minimizing differences between different calls of the same bird
My weapon of choice is numpy/scipy, but might something like openCV might be useful here?
Edit: updated my terminology and reworded approach after some research and Steve's helpful answer.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
必须将此作为答案,因为对于评论来说太长了。
我现在基本上就在这个领域工作,所以我觉得我有一些知识。显然,从我的角度来看,我建议使用音频而不是图像。我还建议使用 MFCC 作为特征提取(您可以将其视为总结/表征音频特定子带的系数[因为它们是])。
GMM 是最佳选择。
要执行此任务,您必须拥有一些(最好是大量)标记/已知数据,否则就没有进行机器学习的基础。
您可能会发现有用的技术细节:
更准确地说,您向每个 GMM 提交一个查询(如果您正确使用它们,每个 GMM 都会为您提供由该概率分布发出的特定特征向量的似然得分 [概率])。然后,您比较从所有 GMM 收到的所有似然分数,并根据收到的最高分数进行分类。
UBM
您可以使用 UBM(通用背景模型)简单地对所有背景噪声/通道失真进行建模,而不是“滤除”噪声。该模型由使用您可用的所有训练数据(即您用于每个类别的所有训练数据)进行训练的 GMM 组成。您可以使用它来获取“似然比”(Pr[x 将由特定模型发出] / Pr[x 将由背景模型 (UBM) 发出] )以帮助消除可以由背景模型本身解释的任何偏差。
Had to make this an answer as it's simply too long for a comment.
I'm basically working in this field right now so I feel I have some knowledge. Obviously from my standpoint I'd recommend working with audio rather than images. I also recommend using MFCCs as your feature extraction (which you can think of as coefficients which summarise/characterise specific sub-bands of audio frequency [because they are]).
GMMs are the go.
To perform this task you must have some (preferably a lot) of labelled/known data, otherwise there is no basis for the machine learning to take place.
A technicality which you may find useful:
More accurately, you submit a query to each GMM (which if you're using them correctly, each gives you a likelihood score [probability] of that particular feature vector being emitted by that probability distribution). Then you compare all the likelihood scores you receive from all the GMMs and classify based on the highest you receive.
UBMs
Rather than "filtering out" noise, you can simply model all background noise/channel distortion with a UBM (Universal Background Model). This model consists of a GMM trained using all the training data available to you (that is, all the training data you used for each class). You can use this to get a 'likelihood ratio' (Pr[x would be emitted by specific model] / Pr[x would be emitted by background model (UBM)]) to help remove any biasing that can be explained by the background model itself.
有趣的问题,但相当广泛。我确实建议您查看一些有关自动鸟鸣识别的现有文献。 (是的,有很多人在研究它。)
这篇论文(编辑:抱歉,死链接,但是Dufour 等人 2014 年的这一章可能更清楚)使用了基本的两阶段模式识别方法,我建议首先尝试:特征提取(本文使用MFCC),然后分类(本文使用 GMM)。对于输入信号中的每一帧,您都会获得一个 MFCC 向量(10 到 30 之间)。这些 MFCC 向量用于训练 GMM(或 SVM)以及相应的鸟类物种标签。然后,在测试过程中,您向 GMM 提交查询 MFCC 向量,它会告诉您它认为它是哪个物种。
尽管有些人已将图像处理技术应用于音频分类/指纹问题(例如,Google Research 的这篇论文),我犹豫是否推荐这些技术来解决您的问题或类似的问题,因为令人讨厌的时间变化。
“我应该使用什么工具来进行此分析?”其中包括:
抱歉,答案不完整,但这是一个广泛的问题,这个问题的内容比这里可以简单回答的要多。
Interesting question, but quite broad. I do recommend looking at some existing literature on automatic bird song identification. (Yup, there are a bunch of people working on it.)
This paper (edit: sorry, dead link, but this chapter by Dufour et al. 2014 might be even clearer) uses a basic two-stage pattern recognition method that I would recommend trying first: feature extraction (the paper uses MFCCs), then classification (the paper uses a GMM). For each frame in the input signal, you get a vector of MFCCs (between 10 to 30). These MFCC vectors are used to train a GMM (or SVM) along with the corresponding bird species labels. Then, during testing, you submit a query MFCC vector to the GMM, and it will tell you which species it thinks it is.
Although some have applied image processing techniques to audio classification/fingerprinting problems (e.g., this paper by Google Research), I hesitate to recommend these techniques for your problem or ones like it because of the annoying temporal variations.
"What tools should I use for this analysis?" Among many others:
Sorry for the incomplete answer, but it's a broad question, and there is more to this problem than can be answered here briefly.
您显然已经在执行 STFT 或类似的操作来构建这些图像,因此我建议构建这些混合时间/频率结构的有用摘要。我记得有一个为稍微不同的目的而构建的系统,该系统能够通过按时间和幅度将音频波形数据分成少量(<30)个箱,并简单地计算每个箱中的样本数量,从而充分利用音频波形数据。垃圾桶。您也许可以在时/幅域或时/频域中执行类似的操作。
You are apparently already performing STFT or something similar to construct those images, so I suggest constructing useful summaries of these mixed time/frequency structures. I remember a system built for a slightly different purpose which was able to make good use of audio waveform data by breaking it into a small number (< 30) of bins by time and amplitude and simply counting the number of samples which fell in each bin. You might be able to do something similar, either in the time/amplitude domain or the time/frequency domain.
根据您想要定义应用程序的方式,您可能需要监督或无监督的方法。在第一种情况下,您将需要一些注释过程,以便为训练阶段提供一组从样本(音频文件)到类(鸟 ID 或任何类)的映射。在无监督方法的情况下,您需要对数据进行聚类,以便将相似的声音映射到同一聚类。
您可以尝试我的库:pyAudioAnalysis,它为声音分类和声音聚类提供高级包装器。
Depending on the way you want to define your application you can either need a supervised or unsupervised approach. In the first case you will need some annotation process in order to provide the training phase with a set of mappings from samples (audio files) to classes (bird IDs or whatever your class is). In the case of unsupervised approach, you need to cluster your data so that similar sounds are mapped to the same cluster.
You could try my library: pyAudioAnalysis which provides high-level wrappers for both sound classification and sound clustering.