评估音频源分离的负SDR结果

发布于 2025-01-21 02:36:32 字数 6945 浏览 1 评论 0 原文

我正在尝试使用 eval_mus_track =“ https://pypi.org/project/museval/” rel =“ nofollow noreferrer”> museval 包装评估我的音频源分离模型。我正在评估的模型经过训练可以预测人声，结果与实际人声相似，但是评估指标（例如SDR）是负面的。

以下是我生成指标的功能：

def estimate_and_evaluate(track):

    #track.audio is stereo therefore we predict each channel separately
    vocals_predicted_channel_1, acompaniment_predicted_channel_1, _ = model_5.predict(np.squeeze(track.audio[:, 0]))
    vocals_predicted_channel_2, acompaniment_predicted_channel_2, _  = model_5.predict(np.squeeze(track.audio[:, 1])                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            )


    vocals = np.squeeze(np.array([vocals_predicted_channel_1.wav_file, vocals_predicted_channel_2.wav_file])).T
    accompaniment = np.squeeze(np.array([acompaniment_predicted_channel_1.wav_file, acompaniment_predicted_channel_2.wav_file])).T
    estimates = {
        'vocals': vocals,
        'accompaniment': accompaniment
    }

    scores = museval.eval_mus_track(track, estimates)
    print(scores)

我获得的度量值是：

vocals          ==> SDR:  -3.776  SIR:   4.621  ISR:  -0.005  SAR: -30.538  
accompaniment   ==> SDR:  -0.590  SIR:   1.704  ISR:  -0.006  SAR: -16.613

上述结果没有意义，因为首先，伴奏预测是纯粹的噪声，因为该模型经过了人声的训练，但它获得了更高的SDR。第二个原因是预测的人声与实际图具有非常相似的图，但仍然具有负SDR值！在以下图中，顶部是实际声音，底部是预测来源：

通道1：

频道2：我试图转移预测的人声，如上所述更糟。

知道是什么原因导致了这个问题？

np.load 加载和操纵它们谢谢你的时间

原文

I'm trying to use eval_mus_track function of the museval package to evaluate my audio source separation model. The model I'm evaluating was trained to predict vocals and the results are similar to the actual vocals but the evaluation metrics such as SDR are negative.

Below is my function for generating the metrics:

def estimate_and_evaluate(track):

    #track.audio is stereo therefore we predict each channel separately
    vocals_predicted_channel_1, acompaniment_predicted_channel_1, _ = model_5.predict(np.squeeze(track.audio[:, 0]))
    vocals_predicted_channel_2, acompaniment_predicted_channel_2, _  = model_5.predict(np.squeeze(track.audio[:, 1])                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            )


    vocals = np.squeeze(np.array([vocals_predicted_channel_1.wav_file, vocals_predicted_channel_2.wav_file])).T
    accompaniment = np.squeeze(np.array([acompaniment_predicted_channel_1.wav_file, acompaniment_predicted_channel_2.wav_file])).T
    estimates = {
        'vocals': vocals,
        'accompaniment': accompaniment
    }

    scores = museval.eval_mus_track(track, estimates)
    print(scores)

The metric values I get are:

vocals          ==> SDR:  -3.776  SIR:   4.621  ISR:  -0.005  SAR: -30.538  
accompaniment   ==> SDR:  -0.590  SIR:   1.704  ISR:  -0.006  SAR: -16.613

The above result doesn't make sense because first of all, accompaniment prediction is pure noise as this model was trained for vocals but it gets a higher SDR. The second reason is the predicted vocals have a very similar graph to the actual ones but still gets a negative SDR value!
In the following graphs, the top one is the actual sound and the bottom one is the predicted source:

Channel 1:

Channel 2:

I tried to shift the predicted vocals as mentioned here but the result got worse.

Any idea what's causing this issue?

This is the link to the actual vocals stereo numpy array
and this one to the predicted stereo vocals numpy array. you can load and manipulate them by using np.load
Thanks for your time