通过比较 Pandas 数据帧中的所有行同时跟踪正在比较的行来获取 Jaccard 相似度

发布于 2025-01-17 07:22:29 字数 1258 浏览 2 评论 0原文

您好,我想获得数据框中所有行之间的杰卡德相似度。

我已经有了一个像下面这样的 jaccard 相似性函数,它接受两个列表,但我无法理解如何跟踪正在进行比较的用户。

def jaccard_similarity(x,y):
  """ returns the jaccard similarity between two lists """
  intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
  union_cardinality = len(set.union(*[set(x), set(y)]))
  return intersection_cardinality/float(union_cardinality)

我想针对数据框中的所有行运行此函数。

措辞用户
苹果、香蕉、橙子、梨adeline
香蕉、菠萝蜜、浆果、苹果ericko
浆果、葡萄、西瓜玛丽

如何生成如下所示的输出,以便跟踪所比较的用户?

user1user2相似度
adelineeriko0.5
adelinemary0.2

非常感谢您的指导。

sentences = ['apple,banana,orange,pears', 'banana,jackfruit,berries,apple']
sentences = [sent.lower().split(",") for sent in sentences]
jaccard_similarity(sentences[0], sentences[1])

输出:0.3333333333333333

运行上面的代码将使我得到我想要的值,但我只是坚持如果我有 100 行数据,如何跟踪数据框中正在比较的用户。

谢谢

Hi I would like to get the Jaccard similarity between all rows in a dataframe.

I already have a jaccard similarity function like the following which is taking in two lists, but I couldn't get my head around how you can keep track of the users for which the comparison is being done.

def jaccard_similarity(x,y):
  """ returns the jaccard similarity between two lists """
  intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
  union_cardinality = len(set.union(*[set(x), set(y)]))
  return intersection_cardinality/float(union_cardinality)

I would like to run this function against all the rows in the dataframe.

wordingsusers
apple,banana,orange,pearsadeline
banana,jackfruit,berries,appleericko
berries,grapes,watermelonmary

How can I generate an output like the below where I can keep track of the users being compared?

user1user2similarity
adelineeriko0.5
adelinemary0.2

Thank you very much for guidance.

sentences = ['apple,banana,orange,pears', 'banana,jackfruit,berries,apple']
sentences = [sent.lower().split(",") for sent in sentences]
jaccard_similarity(sentences[0], sentences[1])

Output: 0.3333333333333333

Running the above code would makes me get the values that I wanted but I am just stuck on how to keep track of the users being compared in the dataframe if I were to have 100 rows of the data.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

埋情葬爱 2025-01-24 07:22:29

可能的解决方案如下:

import itertools
import pandas as pd

# copied from OP above
def jaccard_similarity(x, y):
    """ returns the jaccard similarity between two lists """
    intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
    union_cardinality = len(set.union(*[set(x), set(y)]))
    return intersection_cardinality/float(union_cardinality)

# set initial data and create dataframe
data = {"wordings": ["apple,banana,orange,pears", "banana,jackfruit,berries,apple", "berries,grapes,watermelon"], "users": ["adeline", "ericko", "mary"]}
df = pd.DataFrame(data)

# create list of tuples like [(wording, user), (wording, user)]
wordings_users = list(zip(df["wordings"], df["users"]))

result = []

# create list of all possible combinations between sets of (wording, user) and loop through them
for item in list(itertools.combinations(wordings_users, 2)):
    similarity = jaccard_similarity(item[0][0], item[1][0])
    data = {"user1": item[0][1], "user2": item[1][1], "similarity": similarity}
    result.append(data)

df1 = pd.DataFrame(result)
df1

返回

在此处输入图像描述

Possible solution is the following:

import itertools
import pandas as pd

# copied from OP above
def jaccard_similarity(x, y):
    """ returns the jaccard similarity between two lists """
    intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
    union_cardinality = len(set.union(*[set(x), set(y)]))
    return intersection_cardinality/float(union_cardinality)

# set initial data and create dataframe
data = {"wordings": ["apple,banana,orange,pears", "banana,jackfruit,berries,apple", "berries,grapes,watermelon"], "users": ["adeline", "ericko", "mary"]}
df = pd.DataFrame(data)

# create list of tuples like [(wording, user), (wording, user)]
wordings_users = list(zip(df["wordings"], df["users"]))

result = []

# create list of all possible combinations between sets of (wording, user) and loop through them
for item in list(itertools.combinations(wordings_users, 2)):
    similarity = jaccard_similarity(item[0][0], item[1][0])
    data = {"user1": item[0][1], "user2": item[1][1], "similarity": similarity}
    result.append(data)

df1 = pd.DataFrame(result)
df1

Returns

enter image description here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文