在机器学习中处理可变长度音频

发布于 2025-02-04 03:01:39 字数 450 浏览 4 评论 0原文

我正在研究一个用于语音情感识别的模型,目前我处于预处理阶段,创建了一个可以将音频文件转换为固定尺寸的功能空间的实用程序。我正计划尝试使用光谱图,MEL-SPECTROGRAM和MFCC(包括Deltas和Delta-Deltas)作为卷积神经网络的输入特征。一个明显的问题是音频的长度可变。

现在,我知道处理此问题的典型方法是设置一定的长度,然后扩展所有音频文件以适合该长度或截断文件,但是我认为以前的方法是可取的,因为截断会丢失一些可能有价值的数据在培训中。因此,我打算用0s填充音频文件,以将其扩展到一些固定的长度。我在数据集中找到了音频文件的最大持续时间,然后拿起天花板以获得长度。现在,我打算添加尾随0(重新采样后,将其添加到固定采样率)以扩展所有音频文件以具有静态长度。

我的问题是,这些额外的维度不是可能使模型混淆吗?我知道神经网络会自动处理特征提取,但是我应该知道任何潜在的警告,还是也许有一些替代方法可以产生更好的结果?

谢谢。

I'm working on a model for speech emotion recognition, and I'm currently in the pre-processing phase, creating a utility that can transform audio files into a feature space of fixed dimensions. I'm planning on experimenting with spectrograms, mel-spectrograms and mfccs (including deltas and delta-deltas) as input features for convolutional neural networks. One glaring issue is that the audio is of variable length.

Now, I know that the typical method of dealing with this is setting some length and then expanding all the audio files to fit that length, or truncating files, but I imagine the former method is preferable, because truncation loses some data that could be valuable in training. So I intend to pad audio files with 0s to expand them to some fixed length. I found the maximum duration of an audio file in my dataset and then took its ceiling to get the length. Now I intend to add trailing 0s (after resampling to a fixed sampling rate) to expand all the audio files to have a static length.

My question is, do these extra dimensions not potentially confuse the model? I know neural networks automatically handle feature extraction, but are there any potential caveats I should be made aware of, or perhaps some alternative method for going about doing this that may produce better results?

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文