在Python for循环中,我无法获得预期的计数器值
使用汇编器模块的说话人识别项目。文件夹结构如下:有一个目录speakers
,其中包含spk1
、spk2
等文件夹,每个文件夹包含多个单独的录音扬声器(一个文件夹 - 一个扬声器)。在处理过程中,我制作了一个字典,其中的键是相同的目录 spk1
、spk2
等,它们的值是扬声器录音的嵌入(表示)。
接下来,我想成对比较每个发言者的录音,以计算准确性指标(系统出错的频率)。在下面脚本的第一阶段中,我执行以下操作:创建成对组合,以便对“his”文件夹内任何发言者的记录的所有可能组合进行排序。
第二阶段是将记录的嵌入写入相似度矩阵并通过余弦相似度进行比较。
在最后(第三阶段)我们考虑准确性。我们看到:我们经历了 46 种组合,但由于某种原因我们得到了 0 个匹配。尽管如果打印出相似度矩阵,很明显存在巧合。 for循环有什么问题?
之前,当我使用speechbrain库解决同样的问题时,也出现了类似的问题。然后,计数错误与生成逻辑响应 True
或 False
的张量数据类型相关联。在我看来,这里的情况有所不同。
代码:
!pip install resemblyzer
! pip install umap
import numpy as np
from itertools import combinations
num_true=0
num_total=0
# Stage 1 - for the sake of comparison, we sort through the dictionary values (i.e. embeddings of speakers' records) and create a list of all possible combinations:
# (speaker 1 record 1 - speaker 1 record 2), (speaker 1 record 1 - speaker 1 record 3), etc.
for elems in speaker_wavs.values():
# print(elems[0])
tuples = list(combinations(elems, 2)) # we get a search of all combinations
# Stage 2 - create embeddings of records
for single in tuples: # we go through each combination in the list
# the .embed_utterance() function creates voice embeddings
embeds = (np.array( [encoder.embed_utterance(single[0]) ] ), np.array([encoder.embed_utterance(single[1]) ] ) )
num_total+=1
# Let's calculate the similarity matrix. The similarity of two embeddings is simply their dot product,
# because the similarity metric is cosine similarity, and embeddings are already normalized by L2.
# Short version:
utt_sim_matrix = np.inner(embeds[0], embeds[1]) # The inner product of two arrays
# print('Matrix_1', utt_sim_matrix) # print it out if you need to visually compare embeddings
# Long, detailed version:
utt_sim_matrix2 = np.zeros( (len(embeds[0]), len(embeds[1]) ) )
for i in range(len(embeds[0])):
for j in range(len(embeds[1])):
# The @ notation is equivalent to np.dot(embedds_a[i], embedds_b[i])
utt_sim_matrix2[i, j] = embeds[0][i] @ embeds[1][j]
# print('Matrix_2', utt_sim_matrix2) # print it out if you need to visually compare embeddings
# Returns True if two arrays are equal in elements within the tolerance
if np.allclose(utt_sim_matrix, utt_sim_matrix2) == 'True':
num_true+=1
# print(num_true) # now we get 0
# print(num_total) # now we get 46
# Stage 3 - counting the accuracy metric:
if num_total !=0:
accuracy = num_true/num_total
print(accuracy)
else:
print('You can't divide by zero')
A speaker recognition project using the assemblyzer module. The folder structure is as follows: there is a directory speakers
, which contains folders spk1
, spk2
, etc. Each folder contains several recordings of individual speakers (one folder - one speaker). During processing, I made a dictionary in which the keys are the same directories spk1
, spk2
, etc., and their values are embedding (representations) of speaker recordings.
Next, I want to compare the recordings of each speaker in pairs to calculate the accuracy metric (how often the system makes mistakes). At this first stage of the script below, I do the following: I create pairwise combinations in order to sort through all possible combinations of records of any speaker inside "his" folder.
The second stage is to write the embeds of records into similarity matrices and compare them by means of cosine similarity.
In the final (third stage) we consider accuracy. We see: we went through 46 combinations, but for some reason we got 0 matches. Although if you print out the similarity matrices, it is obvious that there are coincidences. What's wrong with the for loop?
Previously, a similar problem occurred when I solved the same problem using the speechbrain library. Then the counting error was associated with a tensor data type that generates logical responses True
or False
. Here, as it seems to me, is a different case.
Code:
!pip install resemblyzer
! pip install umap
import numpy as np
from itertools import combinations
num_true=0
num_total=0
# Stage 1 - for the sake of comparison, we sort through the dictionary values (i.e. embeddings of speakers' records) and create a list of all possible combinations:
# (speaker 1 record 1 - speaker 1 record 2), (speaker 1 record 1 - speaker 1 record 3), etc.
for elems in speaker_wavs.values():
# print(elems[0])
tuples = list(combinations(elems, 2)) # we get a search of all combinations
# Stage 2 - create embeddings of records
for single in tuples: # we go through each combination in the list
# the .embed_utterance() function creates voice embeddings
embeds = (np.array( [encoder.embed_utterance(single[0]) ] ), np.array([encoder.embed_utterance(single[1]) ] ) )
num_total+=1
# Let's calculate the similarity matrix. The similarity of two embeddings is simply their dot product,
# because the similarity metric is cosine similarity, and embeddings are already normalized by L2.
# Short version:
utt_sim_matrix = np.inner(embeds[0], embeds[1]) # The inner product of two arrays
# print('Matrix_1', utt_sim_matrix) # print it out if you need to visually compare embeddings
# Long, detailed version:
utt_sim_matrix2 = np.zeros( (len(embeds[0]), len(embeds[1]) ) )
for i in range(len(embeds[0])):
for j in range(len(embeds[1])):
# The @ notation is equivalent to np.dot(embedds_a[i], embedds_b[i])
utt_sim_matrix2[i, j] = embeds[0][i] @ embeds[1][j]
# print('Matrix_2', utt_sim_matrix2) # print it out if you need to visually compare embeddings
# Returns True if two arrays are equal in elements within the tolerance
if np.allclose(utt_sim_matrix, utt_sim_matrix2) == 'True':
num_true+=1
# print(num_true) # now we get 0
# print(num_total) # now we get 46
# Stage 3 - counting the accuracy metric:
if num_total !=0:
accuracy = num_true/num_total
print(accuracy)
else:
print('You can't divide by zero')
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
像这样解决了问题:
因此,我显式地将响应转换为字符串类型
Solved the problem like this:
Thus, I explicitly converted the response to a string type