计算余弦相似度并输出无重复项？

发布于 2025-01-10 12:31:05 字数 589 浏览 2 评论 0原文

我的玩具示例中有以下向量：

data = pd.DataFrame({
            'id': [1, 2, 3, 4, 5],
            'a': [55, 2123, -19.3, 9, -8], 
            'b': [21, -0.1, 0.003, 4, 2.1]
        })

我已经计算了相似度矩阵（通过排除 id 列）。

from sklearn.metrics.pairwise import cosine_similarity

# Calculate the pairwise cosine similarities 
S = cosine_similarity(data.drop('id', axis=1))

T  = S.tolist()
df = pd.DataFrame.from_records(T)

它返回给我一个矩阵/数据框，其中包含所有选项，包括自相似性和重复项。是否有任何有效的方法来计算相似性而无需自相似性（向量与自身 100% 相似）和重复项（向量 1 和 2 具有 89% 相似性，我不需要向量 2 和 1 相似性，因为它们是相同的）。

原文

I have the following vectors in my toy example:

data = pd.DataFrame({
            'id': [1, 2, 3, 4, 5],
            'a': [55, 2123, -19.3, 9, -8], 
            'b': [21, -0.1, 0.003, 4, 2.1]
        })

I have calculated similarity matrix (by excluding the id column).

from sklearn.metrics.pairwise import cosine_similarity

# Calculate the pairwise cosine similarities 
S = cosine_similarity(data.drop('id', axis=1))

T  = S.tolist()
df = pd.DataFrame.from_records(T)

It returns me a matrix/dataframe with all options including self similarity and duplicates.
Is there any efficient method to calculate similarity without self similarities (vector is 100% similar to itself) and duplicates (vectors 1 and 2 has 89% similarity, I don't need vectors 2 and 1 similarity as it's the same).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

眼藏柔 2025-01-17 12:31:05

到目前为止，我发现的最佳解决方案是采用对角线下方的下三角形：

[In] S[np.triu_indices_from(S, k=1)]

[Out] array([ 0.93420158, -0.93416293,  0.99856978, -0.81303909, -0.99999999,
    0.91379242, -0.96724292, -0.91374841,  0.96727042, -0.78074903])

这样做的作用是仅采用第 1 个对角线下方的那些值，因此基本上排除这些值和重复值。这也给你一个 numpy 数组。

The best solution I found so far is to take the lower triangle under the diagonal:

[In] S[np.triu_indices_from(S, k=1)]

[Out] array([ 0.93420158, -0.93416293,  0.99856978, -0.81303909, -0.99999999,
    0.91379242, -0.96724292, -0.91374841,  0.96727042, -0.78074903])

What this does is take only those values that are under the 1 diagonal, so basically excluding the ones and the repeating values. This gives you a numpy array, too.

回复收藏 0 原文

~没有更多了~