如何计算两个不同长度的语音文件与Python之间的相似性?
我想比较两个语音文件。 第一个文件(ref)和比较文件(comp)分别由不同的人发音。 我的假设是,语音相似度越接近,发音、语调、语气就会相同。 然而,问题是这两个文件的长度不同。可以比较吗?
!pip install librosa # colab
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import librosa
import librosa.display
plt.figure(figsize=(10, 3))
x_1, fs_1 = librosa.load('voice_tar.wav')
x_2, fs_2 = librosa.load('voice_comp.wav')
print('<Voice_tar>', 'audio shape:', x_1.shape, 'length:', x_1.shape[0]/float(fs_1), 'secs')
print('<Voice_comp>', 'audio shape:', x_2.shape, 'length:', x_2.shape[0]/float(fs_2), 'secs')
fig, ax = plt.subplots(nrows=2, sharex=True, sharey=True)
librosa.display.waveshow(x_1, sr=fs_1, ax=ax[0])
ax[0].set(title='Voice_tar')
ax[0].label_outer()
librosa.display.waveshow(x_2, sr=fs_2, ax=ax[1])
ax[1].set(title='Voice_comp')
结果如下。
<语音_tar>音频形状:(43395,)长度:1.9680272108843537 秒
这是2个语音文件的图像。
2 个语音文件的图像
而且,如何通过 ibrosa.segment.cross_similarity() 获得相似度?
I'd like to compare two voice files.
The first file(ref) and the comparison file(comp) are pronounced by different person, respectively.
My hypothesis is that the closer the speech similarity is, the same pronunciation, intonation, and tone will be.
However, the problem is that the two files have different lengths. Is it possible to compare?
!pip install librosa # colab
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import librosa
import librosa.display
plt.figure(figsize=(10, 3))
x_1, fs_1 = librosa.load('voice_tar.wav')
x_2, fs_2 = librosa.load('voice_comp.wav')
print('<Voice_tar>', 'audio shape:', x_1.shape, 'length:', x_1.shape[0]/float(fs_1), 'secs')
print('<Voice_comp>', 'audio shape:', x_2.shape, 'length:', x_2.shape[0]/float(fs_2), 'secs')
fig, ax = plt.subplots(nrows=2, sharex=True, sharey=True)
librosa.display.waveshow(x_1, sr=fs_1, ax=ax[0])
ax[0].set(title='Voice_tar')
ax[0].label_outer()
librosa.display.waveshow(x_2, sr=fs_2, ax=ax[1])
ax[1].set(title='Voice_comp')
The results are as follows.
<Voice_tar> audio shape: (43395,) length: 1.9680272108843537 secs
<Voice_comp> audio shape: (31673,) length: 1.4364172335600907 secs
This is the image of 2 voice files.
image of 2 voice files
And, how can I get similarity with ibrosa.segment.cross_similarity()?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我也在研究这个问题。最好直接处理音频,但是上面仍然没有解决方案,您可以尝试将其转换为具有图像问题的计算机视觉,如
读取2个音频,将其功能写入2张图像,示例
计算这两个图像之间的相似性,使用phash为 https://pypi.org/project/imagehash/ 或 https://github.com/thorn-oss/perception; 或我考虑过直方图,但还没有做任何事情
您是否尝试过此事我们如何检查哈希值之间的相似性python中的两个音频文件? noreferrer“> https://librosa.org/doc/main/generated/librosa.segment.cross_similarity.html#librosa.segment.cross_simurility
I am studying about this problem, too; it is better to process audio directly but there is still no solution above, you can try converting this to computer vision with image problem, as below
read 2 audios, write their features into 2 images, example
calculate similarity between these two images, use phash as https://pypi.org/project/ImageHash/ or https://github.com/thorn-oss/perception; or I have thought about histogram but haven't done anything yet
Have you tried this how do we check similarity between hash values of two audio files in python? and this https://librosa.org/doc/main/generated/librosa.segment.cross_similarity.html#librosa.segment.cross_similarity