Python-以最有效的方式在两组向量之间找到余弦相似性

发布于 2025-01-31 02:04:01 字数 1867 浏览 2 评论 0原文

我正在尝试计算两组句子之间的平均余弦相似性。将句子转换为嵌入式后,我需要计算AVG。以最有效的方式相似。在这里,我尝试过的时间和时间。有什么方法可以改善此计算?

我还采用了较慢的方法来帮助您了解此案。

x_num是嵌入向量的熊猫,每行通常为512、768、1024或2048长。

class1_indexes:属于1类实例的所有索引 class2_indexes:属于2类实例的所有索引。

我需要计算1类和2类的余弦对之间的余弦相似性。总的来说,我的输出应该是len(class1_indexes)*len(Class2_indexes)的余弦相似性向量。 )长。
我编辑了代码,因为它包括测试案例,您可以看到该方法的运行时间就像:

t1> t2> T3

第三次方法更快20倍。但是我正在寻找更快的方法。

提前致谢。

sample_in_each_class = 1000
X_num = pd.Series([np.random.random(10) for i in range(2*sample_in_each_class)])
class1_indexes = list(range(sample_in_each_class))
class2_indexes = list(range(sample_in_each_class,2*sample_in_each_class))

方法1

def cosine_similarity(vec1, vec2):
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    if norm1 == 0:
        norm1 += 0.00001
    if norm2 == 0:
        norm2 += 0.00001  
    return np.dot(vec1, vec2)/(norm1*norm2)
    
approach1 = []
for idx1 in class1_indexes:
    for idx2 in class2_indexes:
        approach1.append(cosine_similarity(X_num.loc[idx1], X_num.loc[idx2]))

方法2

import itertools
vectors_product = itertools.product(X_num[class1_indexes], X_num[class2_indexes])
vectors_product = pd.Series(list(vectors_product))
approach2 = vectors_product.apply(lambda x: cosine_similarity(x[0], x[1]))

方法3

vectors_product = itertools.product(X_num[class1_indexes], X_num[class2_indexes])
vectors_product = np.array(list(vectors_product))
first_part = vectors_product[:,0,:]
second_part = vectors_product[:,1,:]

numerator = np.sum(np.multiply(first_part, second_part), axis=1)
denominator = (np.multiply(np.linalg.norm(first_part, axis=1), 
                           np.linalg.norm(second_part, axis=1)))
approach3 = numerator / denominator
                  

I am trying to calculate the average cosine similarity between two groups of sentences. After converting the sentences to embeddings, I need to calculate avg. similarities in the most efficient way. Here, what I have tried and the time is taken. Is there any way to improve this calculation?

I also put the slower approaches to help you to understand the case.

X_num is the pandas.Series of embedding vectors, each row is generally 512, 768, 1024 or 2048 long.

class1_indexes : all the indexes of the instances that belong to class 1.
class2_indexes : all the indexes of the instances that belong to class 2.

I need to calculate cosine similarities between each vector pair from class 1 and class 2. Totally, my output should be a cosine similarity vector of len(class1_indexes)*len(class2_indexes ) long.
I edited the code as it includes the test case and you can see that run times for the approaches are like:

t1 > t2 > t3

The third approach is faster 20x times. But I'm looking for much faster approaches.

Thanks in advance.

sample_in_each_class = 1000
X_num = pd.Series([np.random.random(10) for i in range(2*sample_in_each_class)])
class1_indexes = list(range(sample_in_each_class))
class2_indexes = list(range(sample_in_each_class,2*sample_in_each_class))

approach 1

def cosine_similarity(vec1, vec2):
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    if norm1 == 0:
        norm1 += 0.00001
    if norm2 == 0:
        norm2 += 0.00001  
    return np.dot(vec1, vec2)/(norm1*norm2)
    
approach1 = []
for idx1 in class1_indexes:
    for idx2 in class2_indexes:
        approach1.append(cosine_similarity(X_num.loc[idx1], X_num.loc[idx2]))

approach 2

import itertools
vectors_product = itertools.product(X_num[class1_indexes], X_num[class2_indexes])
vectors_product = pd.Series(list(vectors_product))
approach2 = vectors_product.apply(lambda x: cosine_similarity(x[0], x[1]))

approach 3

vectors_product = itertools.product(X_num[class1_indexes], X_num[class2_indexes])
vectors_product = np.array(list(vectors_product))
first_part = vectors_product[:,0,:]
second_part = vectors_product[:,1,:]

numerator = np.sum(np.multiply(first_part, second_part), axis=1)
denominator = (np.multiply(np.linalg.norm(first_part, axis=1), 
                           np.linalg.norm(second_part, axis=1)))
approach3 = numerator / denominator
                  

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

焚却相思 2025-02-07 02:04:02

我认为最有效的方法是:

  1. 将数据转换为两个Numpy Ndarrays,这是每个组的数组。我认为行是嵌入向量,数组称为x1和x2。
  2. do x/np.linalg.norm(x,axis = 1)在每个数组上。
  3. 然后做np.dot(x1,x2.t)
  4. 然后在需要的情况下将结果矩阵弄平。

I think the most efficient way is:

  1. Convert your data to two numpy ndarrays, an array for each group. I assume rows are embedding vectors and arrays are called X1 and X2.
  2. Do X/np.linalg.norm(X, axis=1) on each array.
  3. Then do np.dot(X1, X2.T)
  4. Then flatten the resulting matrix if you want.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文