评估音频源分离的负SDR结果

发布于 2025-01-21 02:36:32 字数 6945 浏览 1 评论 0 原文

我正在尝试使用 eval_mus_track =“ https://pypi.org/project/museval/” rel =“ nofollow noreferrer”> museval 包装评估我的音频源分离模型。我正在评估的模型经过训练可以预测人声,结果与实际人声相似,但是评估指标(例如SDR)是负面的。

以下是我生成指标的功能:

def estimate_and_evaluate(track):

    #track.audio is stereo therefore we predict each channel separately
    vocals_predicted_channel_1, acompaniment_predicted_channel_1, _ = model_5.predict(np.squeeze(track.audio[:, 0]))
    vocals_predicted_channel_2, acompaniment_predicted_channel_2, _  = model_5.predict(np.squeeze(track.audio[:, 1])                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            )


    vocals = np.squeeze(np.array([vocals_predicted_channel_1.wav_file, vocals_predicted_channel_2.wav_file])).T
    accompaniment = np.squeeze(np.array([acompaniment_predicted_channel_1.wav_file, acompaniment_predicted_channel_2.wav_file])).T
    estimates = {
        'vocals': vocals,
        'accompaniment': accompaniment
    }

    scores = museval.eval_mus_track(track, estimates)
    print(scores)

我获得的度量值是:

vocals          ==> SDR:  -3.776  SIR:   4.621  ISR:  -0.005  SAR: -30.538  
accompaniment   ==> SDR:  -0.590  SIR:   1.704  ISR:  -0.006  SAR: -16.613 

上述结果没有意义,因为首先,伴奏预测是纯粹的噪声,因为该模型经过了人声的训练,但它获得了更高的SDR。第二个原因是预测的人声与实际图具有非常相似的图,但仍然具有负SDR值! 在以下图中,顶部是实际声音,底部是预测来源:

通道1:

频道2: 我试图转移预测的人声,如上所述更糟。

知道是什么原因导致了这个问题?

np.load 加载和操纵它们 谢谢你的时间

I'm trying to use eval_mus_track function of the museval package to evaluate my audio source separation model. The model I'm evaluating was trained to predict vocals and the results are similar to the actual vocals but the evaluation metrics such as SDR are negative.

Below is my function for generating the metrics:

def estimate_and_evaluate(track):

    #track.audio is stereo therefore we predict each channel separately
    vocals_predicted_channel_1, acompaniment_predicted_channel_1, _ = model_5.predict(np.squeeze(track.audio[:, 0]))
    vocals_predicted_channel_2, acompaniment_predicted_channel_2, _  = model_5.predict(np.squeeze(track.audio[:, 1])                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            )


    vocals = np.squeeze(np.array([vocals_predicted_channel_1.wav_file, vocals_predicted_channel_2.wav_file])).T
    accompaniment = np.squeeze(np.array([acompaniment_predicted_channel_1.wav_file, acompaniment_predicted_channel_2.wav_file])).T
    estimates = {
        'vocals': vocals,
        'accompaniment': accompaniment
    }

    scores = museval.eval_mus_track(track, estimates)
    print(scores)

The metric values I get are:

vocals          ==> SDR:  -3.776  SIR:   4.621  ISR:  -0.005  SAR: -30.538  
accompaniment   ==> SDR:  -0.590  SIR:   1.704  ISR:  -0.006  SAR: -16.613 

The above result doesn't make sense because first of all, accompaniment prediction is pure noise as this model was trained for vocals but it gets a higher SDR. The second reason is the predicted vocals have a very similar graph to the actual ones but still gets a negative SDR value!
In the following graphs, the top one is the actual sound and the bottom one is the predicted source:

Channel 1:
actual vocals channel 1
predicted vocals channel 1

Channel 2:
actual vocals channel 2
predicted vocals channel 2
I tried to shift the predicted vocals as mentioned here but the result got worse.

Any idea what's causing this issue?

This is the link to the actual vocals stereo numpy array
and this one to the predicted stereo vocals numpy array. you can load and manipulate them by using np.load
Thanks for your time

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

妄想挽回 2025-01-28 02:36:32

信号失真比实际上是比率的对数。参见本文公式(12):
https://hal.inria.fr/inria-00630985/PDF/vincent_SigPro11。 pdf

因此,SDR 为 0 意味着信号等于失真。 SDR 值小于 0 意味着存在比信号更多的失真。如果音频听起来不像信号失真更多,则原因通常是样本对齐问题。

当您查看等式 (12) 时,您可以看到计算在很大程度上取决于保留预测的真实音频的精确样本对齐。从波形图甚至聆听中都很难判断样本是否未对齐。但是,您可以看到每个单独样本的放大图可以帮助您确保真实样本和预测样本完全对齐。即使移动一个样本,SDR 计算也不会反映实际的 SDR。

The signal to distortion ratio is actually the logarithm of a ratio. See equation (12) of this article:
https://hal.inria.fr/inria-00630985/PDF/vincent_SigPro11.pdf

So, a SDR of 0 means that the signal is equal to the distortion. An SDR value of less than 0 means that there is more distortion than signal. If the audio doesn't sound like there is more distortion than signal, the cause is often sample alignment problems.

When you look at equation (12), you can see that the calculation depends strongly on preserving the exact sample alignment of the predicted a ground-truth audio. It can be difficult to tell from plots of the waveform or even listening if the samples are misaligned. But, a zoomed-in plot where you can see each individual sample could help you make sure that the ground truth and predicted samples are exactly lined up. If it is shifted by even a single sample, the SDR calculation will not reflect the actual SDR.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文