使用Jaccard Index Python计算相似度
我想使用 Jaccard Index 来查找数据帧(user_choices)元素之间的相似性。
import scipy.spatial
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
user_choices = [[1, 0, 0, 1, 0, 1],
[0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 1],
[0, 0, 0, 0, 1, 0],
[1, 0, 0, 1, 0, 1]]
df_choices = pd.DataFrame(user_choices, columns=["User A", "User B", "User C", "User D", "User E", "User F"],
index=(["User A", "User B", "User C", "User D", "User E", "User F"]))
df_choices
我编写了这段代码来计算我的数据的杰卡德指数:
jaccard = (1-scipy.spatial.distance.cdist(df_choices, df_choices,
metric='jaccard'))
user_distance = pd.DataFrame(jaccard, columns=df_choices.index.values,
index=df_choices.index.values)
user_distance
但是这些是输出,与我的数据相同!
I want to use Jaccard Index to find the similarity among elements of the dataframe (user_choices).
import scipy.spatial
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
user_choices = [[1, 0, 0, 1, 0, 1],
[0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[1, 0, 0, 1, 0, 1],
[0, 0, 0, 0, 1, 0],
[1, 0, 0, 1, 0, 1]]
df_choices = pd.DataFrame(user_choices, columns=["User A", "User B", "User C", "User D", "User E", "User F"],
index=(["User A", "User B", "User C", "User D", "User E", "User F"]))
df_choices
I wrote this code to calculate a Jaccard Index for my data:
jaccard = (1-scipy.spatial.distance.cdist(df_choices, df_choices,
metric='jaccard'))
user_distance = pd.DataFrame(jaccard, columns=df_choices.index.values,
index=df_choices.index.values)
user_distance
But These are the outputs, which are identical to my data!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果我理解正确,你想要
user_distance[i,j] = jaccard-distance(df_choices[i], df_choices[j])
你可以通过两个步骤得到这个(1)计算对距离,这将得到有序对的距离 (2) 从压缩距离矩阵中获得平方形式。
您有一个对称矩阵,因此距离矩阵应该是对称的。
对于矩阵中的任何一对行,元素要么全部相等,要么全部不同,因此输出矩阵将只有 1 和 0。
如果您在以下示例中尝试相同的代码,
您将得到与输入不同的输出。
If I understand correctly you want
user_distance[i,j] = jaccard-distance(df_choices[i], df_choices[j])
You can get this in two steps (1) calculate the pairs distance, this will get the distance for ordered pairs (2) obtain the square form from the condensed distance matrix.
You have a symmetric matrix so the distance matrix is expected to be symmetric
For any pair of rows in your matrix there the elements are either all equal or all different, so the output matrix will have only ones and zeros.
if you try the same code with the following example
You will have output different from the input.
例如行向量(1,0,0,1,0,1)的用户F到用户A的杰卡德距离为零;然后计算 1 - scipy.spatial.distance.cdist(...) = 1。
与例如的 Jaccard 距离。用户 E 到用户 A 的行向量 (0, 0, 0, 0, 1, 0) 为 1;您计算 1 - 1 = 0。
当使用杰卡德距离作为度量时,您可能意外地得到了与其自身距离矩阵相同的某个输入,减一。
也许您不希望那个 (1-...) 在那里?
The Jaccard distance from eg user F with row vector (1, 0, 0, 1, 0, 1) to user A is zero; and you compute 1 - scipy.spatial.distance.cdist(...) = 1.
The Jaccard distance from eg. user E with row vector (0, 0, 0, 0, 1, 0) to user A is one; you compute 1 - 1 = 0.
You have perhaps accidentally arrived at some input that is identical to its own distance matrix when using Jaccard distance as a metric, minus one.
Maybe you don't want that (1-...) there?