在机器学习中处理可变长度音频
我正在研究一个用于语音情感识别的模型,目前我处于预处理阶段,创建了一个可以将音频文件转换为固定尺寸的功能空间的实用程序。我正计划尝试使用光谱图,MEL-SPECTROGRAM和MFCC(包括Deltas和Delta-Deltas)作为卷积神经网络的输入特征。一个明显的问题是音频的长度可变。
现在,我知道处理此问题的典型方法是设置一定的长度,然后扩展所有音频文件以适合该长度或截断文件,但是我认为以前的方法是可取的,因为截断会丢失一些可能有价值的数据在培训中。因此,我打算用0s填充音频文件,以将其扩展到一些固定的长度。我在数据集中找到了音频文件的最大持续时间,然后拿起天花板以获得长度。现在,我打算添加尾随0(重新采样后,将其添加到固定采样率)以扩展所有音频文件以具有静态长度。
我的问题是,这些额外的维度不是可能使模型混淆吗?我知道神经网络会自动处理特征提取,但是我应该知道任何潜在的警告,还是也许有一些替代方法可以产生更好的结果?
谢谢。
I'm working on a model for speech emotion recognition, and I'm currently in the pre-processing phase, creating a utility that can transform audio files into a feature space of fixed dimensions. I'm planning on experimenting with spectrograms, mel-spectrograms and mfccs (including deltas and delta-deltas) as input features for convolutional neural networks. One glaring issue is that the audio is of variable length.
Now, I know that the typical method of dealing with this is setting some length and then expanding all the audio files to fit that length, or truncating files, but I imagine the former method is preferable, because truncation loses some data that could be valuable in training. So I intend to pad audio files with 0s to expand them to some fixed length. I found the maximum duration of an audio file in my dataset and then took its ceiling to get the length. Now I intend to add trailing 0s (after resampling to a fixed sampling rate) to expand all the audio files to have a static length.
My question is, do these extra dimensions not potentially confuse the model? I know neural networks automatically handle feature extraction, but are there any potential caveats I should be made aware of, or perhaps some alternative method for going about doing this that may produce better results?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论