CNN 建模的理想方法是什么?
我正在尝试检测音频文件中某种类型的声音。这些录音的长度是可变的,我想要检测的声音类型通常约为 1~5 秒长,并且我有数据集的标签(事件发生时的开始和偏移)。
我最初的方法是将其视为二元分类问题。我每半秒计算一次梅尔谱图(例如)。如果这 0.5 秒内没有发生事件,我会将该频谱图标记为 0,否则标记为 1。
我可以用什么方法来对抗这个呢?我试图通过传递 0.1 而不是 1 来进行更改(假设前面的示例)。基本上标记图像中发生的事件的百分比:标签为 [0~1] 而不是 {0,1}。
非常感谢。
I am trying to perform detection of a certain type of sound in audio files. These audio recordings have variable lengths and the type of sound that I want to detect is usually around 1~5 seconds long and I have the labels of the dataset (onset and offset of when events happen).
My initial approach was by treating it as a binary classification problem. Where I compute the mel spectrogram each half a second (for example). I would label that spectrogram with a 0 if there wasn't a event in those 0.5s and labeled it 1 if the other way.
In what way could I fight this? I am trying to change by passing 0.1 instead of 1 (assuming the previous example). Basically labeling the percentage of the the event happening in the image: labels [0~1] instead of {0,1}.
Many thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

我通过使用固定输入大小的 CNN 进行简单分类来解决此类问题,然后在扫描可变长度样本(1-5 秒原声片段)时多次调用 CNN。
例如,假设您创建一个输入 0.2 秒数据的 CNN,输入大小现在是固定的。您可以根据样本的中心点是否位于您在问题中定义的事件内,计算 0.2 秒的 {0, 1} 标签。您可以使用相同的方法尝试不同的输入大小。
现在,您要求 CNN 对 1-5 秒样本中的每个点进行预测。首先向 CNN 传递前 0.2 秒的数据,然后向前推进一个或多个数据点(步长是一个可以调整的超参数)。假设您的步长为 0.1 秒,第二步将使用样本中 0.1 秒到 0.3 秒的数据生成 CNN 分类。继续,直到到达样本的末尾。您现在已经对样本进行了分类。原则上,您可以在每个数据点进行分类,这样您就有与数据点一样多的预测。滚动中值滤波器(参见 pandas)是平滑预测的好方法。
这是一个非常简单的 CNN 设置。您还可以通过大量增加训练数据来受益,因为每个声音文件现在都是许多训练样本。使用此方法,您的预测分辨率非常精细。
这里有一篇论文更深入地描述了该方法(如果您需要付费的话,arXiv 上还有一个稍早的同名版本),请从第 3 节开始阅读:
https://academic.oup.com/mnras/article/476/1/1151/4828364
在那篇论文中,我们正在处理一维天文学数据,其结构与一维音频数据基本相同,因此该技术将适用。在那篇论文中,我所做的不仅仅是分类,我使用相同的技术来本地化零个或多个事件以及描述这些事件的特征(我将仅根据您的目的进行分类)。所以你可以看到这种方法扩展得很好。事实上,即使是在时间上部分重叠的多个事件也可以被有效地识别和提取。
I have approached problems like this by using a fixed input-size CNN to do a simple classification and then called the CNN multiple times as you scan across your variable length sample (1-5 sec sound bite).
For example, let's say you create a CNN that inputs 0.2s of data, the input size is now fixed. You can compute a {0, 1} label for that 0.2s based on whether the center point of the sample is within an event as you defined in your question. You could try different input sizes using the same method.
Now you ask the CNN to make a prediction at every point in your 1-5 second sample. To start with you pass the CNN the first 0.2s of data, then step forward one or more data points (your step size is a hyper-parameter you can tune). Let's say your step size is 0.1s, your second step would produce a CNN classification using the data from 0.1s to 0.3s in your sample. Continue until you reach the end of your sample. You now have classifications across the sample. In principle you could get a classification at every data point so you have as many predictions as you have data points. A rolling median filter (see pandas) is a great way to smooth out the predictions.
This is a very simple CNN to set up. You also benefit by increasing your training data quite a bit because each sound file is now many training samples. Your resolution for predictions is very granular with this method.
Here's a paper that describes the approach in greater depth (there's also a slightly earlier version on arXiv by the same title if that's pay walled for you), start reading at Section 3 onward:
https://academic.oup.com/mnras/article/476/1/1151/4828364
In that paper we're working with 1D astronomy data, which is structured basically the same as 1D audio data, so the technique will apply. In that paper I'm doing a bit more than just classification, using the same technique I'm localizing zero or more events as well as characterizing those events (I would start with just the classification for your purposes). So you can see that this approach extends quite well. In fact even multiple events that partially overlap each other in time can be identified and extracted effectively.