通过比较 Pandas 数据帧中的所有行同时跟踪正在比较的行来获取 Jaccard 相似度
您好,我想获得数据框中所有行之间的杰卡德相似度。
我已经有了一个像下面这样的 jaccard 相似性函数,它接受两个列表,但我无法理解如何跟踪正在进行比较的用户。
def jaccard_similarity(x,y):
""" returns the jaccard similarity between two lists """
intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
union_cardinality = len(set.union(*[set(x), set(y)]))
return intersection_cardinality/float(union_cardinality)
我想针对数据框中的所有行运行此函数。
措辞 | 用户 |
---|---|
苹果、香蕉、橙子、梨 | adeline |
香蕉、菠萝蜜、浆果、苹果 | ericko |
浆果、葡萄、西瓜 | 玛丽 |
如何生成如下所示的输出,以便跟踪所比较的用户?
user1 | user2 | 相似度 |
---|---|---|
adeline | eriko | 0.5 |
adeline | mary | 0.2 |
非常感谢您的指导。
sentences = ['apple,banana,orange,pears', 'banana,jackfruit,berries,apple']
sentences = [sent.lower().split(",") for sent in sentences]
jaccard_similarity(sentences[0], sentences[1])
输出:0.3333333333333333
运行上面的代码将使我得到我想要的值,但我只是坚持如果我有 100 行数据,如何跟踪数据框中正在比较的用户。
谢谢
Hi I would like to get the Jaccard similarity between all rows in a dataframe.
I already have a jaccard similarity function like the following which is taking in two lists, but I couldn't get my head around how you can keep track of the users for which the comparison is being done.
def jaccard_similarity(x,y):
""" returns the jaccard similarity between two lists """
intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
union_cardinality = len(set.union(*[set(x), set(y)]))
return intersection_cardinality/float(union_cardinality)
I would like to run this function against all the rows in the dataframe.
wordings | users |
---|---|
apple,banana,orange,pears | adeline |
banana,jackfruit,berries,apple | ericko |
berries,grapes,watermelon | mary |
How can I generate an output like the below where I can keep track of the users being compared?
user1 | user2 | similarity |
---|---|---|
adeline | eriko | 0.5 |
adeline | mary | 0.2 |
Thank you very much for guidance.
sentences = ['apple,banana,orange,pears', 'banana,jackfruit,berries,apple']
sentences = [sent.lower().split(",") for sent in sentences]
jaccard_similarity(sentences[0], sentences[1])
Output: 0.3333333333333333
Running the above code would makes me get the values that I wanted but I am just stuck on how to keep track of the users being compared in the dataframe if I were to have 100 rows of the data.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
可能的解决方案如下:
返回
Possible solution is the following:
Returns