I'm working on a model for speech emotion recognition, and I'm currently in the pre-processing phase, creating a utility that can transform audio files into a feature space of fixed dimensions. I'm planning on experimenting with spectrograms, mel-spectrograms and mfccs (including deltas and delta-deltas) as input features for convolutional neural networks. One glaring issue is that the audio is of variable length.
Now, I know that the typical method of dealing with this is setting some length and then expanding all the audio files to fit that length, or truncating files, but I imagine the former method is preferable, because truncation loses some data that could be valuable in training. So I intend to pad audio files with 0s to expand them to some fixed length. I found the maximum duration of an audio file in my dataset and then took its ceiling to get the length. Now I intend to add trailing 0s (after resampling to a fixed sampling rate) to expand all the audio files to have a static length.
My question is, do these extra dimensions not potentially confuse the model? I know neural networks automatically handle feature extraction, but are there any potential caveats I should be made aware of, or perhaps some alternative method for going about doing this that may produce better results?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
