我正在研究一个项目,以比较某人的唱歌与原始艺术家的相似性。大多对声音的音调感兴趣,以查看它们是否调整。
音频文件以.WAV格式,我能够使用波浪模块加载它们并将其转换为Numpy数组。然后,我构建了一个频率和一个时间向量来绘制信号。
raw_audio = wave.open("myAudio.WAV", "r")
audio = raw_audio.readframes(-1)
signal = np.frombuffer(audio, dtype='int16')
fs = raw_audio.getframerate()
timeDelta = 1/(2*fs)
#Get time and frequency vectors
start = 0
end = len(signal)*timeDelta
points = len(signal)
t = np.linspace(start, end, points)
f = np.linspace(0,fs,points)
如果我有另一个相同持续时间的信号(它们在大约5-10秒处降落)。比较这两个信号的相似性是什么最好的方法?
我已经考虑过比较频域和自相关的比较,但是我觉得这两种方法都有很多缺点。
I'm working on a project to compare how similar someone's singing is to the original artist. Mostly interested in the pitch of the voice to see if they're in tune.
The audio files are in .wav format and I've been able to load them with the wave module and convert them to Numpy arrays. Then I built a frequency and a time vector to plot the signal.
raw_audio = wave.open("myAudio.WAV", "r")
audio = raw_audio.readframes(-1)
signal = np.frombuffer(audio, dtype='int16')
fs = raw_audio.getframerate()
timeDelta = 1/(2*fs)
#Get time and frequency vectors
start = 0
end = len(signal)*timeDelta
points = len(signal)
t = np.linspace(start, end, points)
f = np.linspace(0,fs,points)
If I have another signal of the same duration (they're landing at approximately 5-10 seconds). What would be the best way to compare these two signals for similarity?
I've thought of comparing the frequency domains and autocorrelation but I feel that both of those methods have a lot of drawbacks.
发布评论
评论(1)
我面临一个类似的问题,即评估两个音频信号的相似性(一个真实,一个由机器学习管道产生)。我有信号部分,其中比较非常关键时间(代表不同早期反射的峰之间的时间差),为此,我将尝试计算信号之间的互相关(更多详细介绍此处:
声音会时间域截然不同,这可能不是您的问题的理想选择。
对于频率信息(例如音调和音色)引起更大兴趣的信号,我将在频域中工作。例如,您可以为两个信号计算短时ffts(STFT)或CQT(映射到八度的频谱的更音乐表示),然后通过计算于点(通过计算于点的eRror)(例如MSE)在两个信号的时间窗口之间。在转换之前,您应该放大路线将信号归一化。可以轻松地完成STFT,CQT和标准化
有关这种方法的两件事:
不要使stfts的时间窗口太短。人类的光谱
声音从Hundret-Hertz系列开始
(
这里给出350 Hz为低端)。因此
(或长度)您的stft-time-windows至少应为:
因此,如果您的录音有44100 Hz采样频率,则您的时间
窗口至少必须是
使其128,那是一个更好的数字。这样,您保证
基本频率为350 Hz的声波仍然可以“看到”
在一个窗口中至少一个整个时期。当然更大
Windows将为您提供更多精确的光谱表示。
转换之前,您应该确保这两个信号
正在比较同时代表相同的声音事件。所以
如果两位歌手不唱同样的话,所有这些都不起作用
或者不以相同的速度或背景噪声不同
信号。只要您只有干燥的录音
声音和这些声音以相等的速度唱着同一件事,你只是
需要确保信号启动对齐。通常,你
需要确保有声音事件(例如瞬态,沉默,
注意)对齐。当一个信号中有一个很长的aaah响起时,
在另一个信号中也应该是一个很长的AAAH。你可以做
通过增加STFT窗口,您的评估更加可靠
更进一步的是,这将减少时间分辨率(您将获得更少的
信号的频谱表示),但更多的声音事件是
在一个时间窗口中一起评估。
当然,您可以在整个长度上为每个信号生成一个FFT,但是如果您在相等的短时间窗口中生成stfts或cqts(或其他一些更适合人类听力的转换),结果将更有意义MSE适用于每对窗口(信号1的首次窗口和信号2的第一个窗口,然后是第二个窗口对,然后是第三个窗口,依此类推)。
希望这会有所帮助。
I am faced with a similar Problem of evaluating the similarity of two Audio signals (one real, one generated by a machine learning pipeline). I have signal parts, where the comparison is very time-critical (time-difference between peaks representing arrival of different early reflections) and for this I will try out calculating the cross-correlation between the signals (more on that here: https://www.researchgate.net/post/how_to_measure_the_similarity_between_two_signal )
Since natural recordings of two different voices will be quite different in time domain, this would probably not be ideal for your problem.
For signals where Frequency information (like pitch and timbre) is of greater interest, I would work in frequency domain. You can for example calculate short-time-ffts (stft) or cqt (a more musical representation of the spectrum as it is mapped to octaves) for the two signals and then compare the similarities for example by calculating the Mean-Squared-Error (MSE) between the time windows of the two signals. Before transforming you should off course normalize the signals. STFT, CQT and normalization can easily be done and visualized with librosa
Two things about this approach:
Dont make the time windows of your stfts too short. Spectra of human
voices start somewhere in the hundret-hertz range
(https://av-info.eu/index.html?https&&&av-info.eu/audio/speech-level.html
here 350 Hz is given as the low end). So the Amount of samples in
(or length of) your stft-time-windows should at least be:
So if your recordings have 44100 Hz sampling frequency, your time
window must be at least
Make it 128, thats a nicer number. That way you guarantee that a
sound wave with fundamental frequency of 350 Hz can still be "seen"
for at least one full Period in a single window. Of course bigger
windows will give you more exact spectral representation.
Before transforming you should make sure that the two signals you
are comparing represent the same sound events at the same time. So
all of this doesn't work if the two singers didn't sing the same thing
or not at the same speed or there are different background noises in
the signals. Provided that you have dry recordings of only the
voices and these voices sing the same thing at equal speed, you just
need to make sure that the signal starts align. In general, you
need to make sure that sound events (e.g. transients, silence,
notes) align. When there is a long AAAH-sound in one signal, there
should also be a long AAAh-sound in the other signal. You can make
your evaluation somewhat more robust by increasing the stft windows
even further, this will reduce time resolution (you will get less
spectral representations of signals) but more sound events are
evaluated together in one time window.
You could of course just generate one fft for each signal over the entire length but the results will be more meaningful if you generate stfts or cqts (or some other transform better suited for human hearing) over equal lengthed, short time windows, then calculate the mse for each pair of time windows (first time window of signal 1 and first window of signal 2, then the second window pair, then the third and so on).
Hope this helps.