使用Jaccard Index Python计算相似度

发布于 2025-01-14 05:22:38 字数 1242 浏览 4 评论 0原文

我想使用 Jaccard Index 来查找数据帧（user_choices）元素之间的相似性。

import scipy.spatial
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

user_choices = [[1, 0, 0, 1, 0, 1], 
                [0, 1, 0, 0, 0, 0], 
                [0, 0, 1, 0, 0, 0],
                [1, 0, 0, 1, 0, 1],
                [0, 0, 0, 0, 1, 0],
                [1, 0, 0, 1, 0, 1]]
df_choices = pd.DataFrame(user_choices, columns=["User A", "User B", "User C", "User D", "User E", "User F"], 
                          index=(["User A", "User B", "User C", "User D", "User E", "User F"]))

df_choices

我编写了这段代码来计算我的数据的杰卡德指数：

jaccard = (1-scipy.spatial.distance.cdist(df_choices, df_choices,  
                                       metric='jaccard'))
user_distance = pd.DataFrame(jaccard, columns=df_choices.index.values,  
                             index=df_choices.index.values)

user_distance

但是这些是输出，与我的数据相同！

原文

I want to use Jaccard Index to find the similarity among elements of the dataframe (user_choices).

import scipy.spatial
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

user_choices = [[1, 0, 0, 1, 0, 1], 
                [0, 1, 0, 0, 0, 0], 
                [0, 0, 1, 0, 0, 0],
                [1, 0, 0, 1, 0, 1],
                [0, 0, 0, 0, 1, 0],
                [1, 0, 0, 1, 0, 1]]
df_choices = pd.DataFrame(user_choices, columns=["User A", "User B", "User C", "User D", "User E", "User F"], 
                          index=(["User A", "User B", "User C", "User D", "User E", "User F"]))

df_choices

I wrote this code to calculate a Jaccard Index for my data:

jaccard = (1-scipy.spatial.distance.cdist(df_choices, df_choices,  
                                       metric='jaccard'))
user_distance = pd.DataFrame(jaccard, columns=df_choices.index.values,  
                             index=df_choices.index.values)

user_distance

But These are the outputs, which are identical to my data!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

玉环 2025-01-21 05:22:38

如果我理解正确，你想要 user_distance[i,j] = jaccard-distance(df_choices[i], df_choices[j])

你可以通过两个步骤得到这个（1）计算对距离，这将得到有序对的距离 (2) 从压缩距离矩阵中获得平方形式。

jaccard = scipy.spatial.distance.pdist(df_choices, 'jaccard')
user_distances = pd.DataFrame(1-scipy.spatial.distance.squareform(jaccard), 
                              columns=df_choices.index.values,  
                              index=df_choices.index.values)

您有一个对称矩阵，因此距离矩阵应该是对称的。

对于矩阵中的任何一对行，元素要么全部相等，要么全部不同，因此输出矩阵将只有 1 和 0。

如果您在以下示例中尝试相同的代码，

user_choices = [[1, 0, 0, 3, 0, 4], 
                [0, 1, 0, 0, 0, 0], 
                [0, 0, 1, 0, 0, 0],
                [1, 0, 0, 1, 0, 1],
                [0, 0, 0, 0, 1, 0],
                [1, 0, 0, 1, 0, 1]]

您将得到与输入不同的输出。

If I understand correctly you want user_distance[i,j] = jaccard-distance(df_choices[i], df_choices[j])

You can get this in two steps (1) calculate the pairs distance, this will get the distance for ordered pairs (2) obtain the square form from the condensed distance matrix.

jaccard = scipy.spatial.distance.pdist(df_choices, 'jaccard')
user_distances = pd.DataFrame(1-scipy.spatial.distance.squareform(jaccard), 
                              columns=df_choices.index.values,  
                              index=df_choices.index.values)

You have a symmetric matrix so the distance matrix is expected to be symmetric

For any pair of rows in your matrix there the elements are either all equal or all different, so the output matrix will have only ones and zeros.

if you try the same code with the following example

user_choices = [[1, 0, 0, 3, 0, 4], 
                [0, 1, 0, 0, 0, 0], 
                [0, 0, 1, 0, 0, 0],
                [1, 0, 0, 1, 0, 1],
                [0, 0, 0, 0, 1, 0],
                [1, 0, 0, 1, 0, 1]]

You will have output different from the input.

回复收藏 0 原文

╰◇生如夏花灿烂 2025-01-21 05:22:38

例如行向量(1,0,0,1,0,1)的用户F到用户A的杰卡德距离为零；然后计算 1 - scipy.spatial.distance.cdist(...) = 1。
与例如的 Jaccard 距离。用户 E 到用户 A 的行向量 (0, 0, 0, 0, 1, 0) 为 1；您计算 1 - 1 = 0。

>>> print(scipy.spatial.distance.jaccard(user_choices[0], user_choices[5]))
0.0
>>> print(scipy.spatial.distance.jaccard(user_choices[0], user_choices[4]))
1.0

当使用杰卡德距离作为度量时，您可能意外地得到了与其自身距离矩阵相同的某个输入，减一。

也许您不希望那个 (1-...) 在那里？

The Jaccard distance from eg user F with row vector (1, 0, 0, 1, 0, 1) to user A is zero; and you compute 1 - scipy.spatial.distance.cdist(...) = 1.
The Jaccard distance from eg. user E with row vector (0, 0, 0, 0, 1, 0) to user A is one; you compute 1 - 1 = 0.

>>> print(scipy.spatial.distance.jaccard(user_choices[0], user_choices[5]))
0.0
>>> print(scipy.spatial.distance.jaccard(user_choices[0], user_choices[4]))
1.0

You have perhaps accidentally arrived at some input that is identical to its own distance matrix when using Jaccard distance as a metric, minus one.

Maybe you don't want that (1-...) there?

回复收藏 0 原文

~没有更多了~