如何获得熊猫DF中两个文本列的交点
我有一个看起来像这样的DF:
id textcol1 textcol2 ... coln
1 blue bowl green bowl ... xxx
2 purple sheet green grass ... xxx
3 ground black pepper ground black pepper ... xxx
等等...
想获得TextCol1和TextCol2之间的常见单词的百分比
id textcol1 textcol2 ... coln intersection
1 blue bowl green bowl ... xxx 50
2 purple sheet green grass ... xxx 0
3 ground black pepper ground black pepper ... xxx 100
我想在很长一段时间内提出以下解决方案之后,
df['intersection'] = [(len(set(a) & set(b)) / float(len(set(a) | set(b))) * 100) for a, b in zip(df.textcol1, df.textcol2)]
我 预计,例如,两次通过“地面黑胡椒”会产生93.3333333333330。
我已经完成了所有通常的清洁步骤 - 删除空格等 - 但是无法弄清楚这里的问题是什么。
我想念什么?
I have a df that looks like this:
id textcol1 textcol2 ... coln
1 blue bowl green bowl ... xxx
2 purple sheet green grass ... xxx
3 ground black pepper ground black pepper ... xxx
and so on...
I want to get the percentage of common words between textcol1 and textcol2
id textcol1 textcol2 ... coln intersection
1 blue bowl green bowl ... xxx 50
2 purple sheet green grass ... xxx 0
3 ground black pepper ground black pepper ... xxx 100
After an embarrassingly long time I've come up with the following solution
df['intersection'] = [(len(set(a) & set(b)) / float(len(set(a) | set(b))) * 100) for a, b in zip(df.textcol1, df.textcol2)]
But the results are not what I would expect, for example passing "ground black pepper" twice yields 93.33333333333330.
I've gone through all the usual cleaning steps - removing whitespace, etc. - but can't figure out what the issue is here.
What am I missing?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
考虑首先编写通用文本比较函数,例如
text_diff
,它计算两个文本之间的简单令牌重叠(akaset> set
of tokens s)您要比较并将它们变成一个令牌集的列,例如,
结果:
因此,您可以通过这种方式轻松地应用功能,
可以轻松调整和/或替换您的
text_diff
函数。也对文本相似性度量进行一些研究,并使用现有工具(如果适用)。 <Consider writing a generic text comparison function first, something like
text_diff
below that computes a simple token overlap between two texts (akaset
s of tokens):Then you can get the two columns you want to compare and turn them into sets of tokens, e.g.,
Result:
So you can easily apply the function by doing
That way you can easily tweak and/or replace your
text_diff
function. Do some research on text similarity measures, too, and use existing tools if applicable. fuzzywuzzy could be worth a shot, too.我认为其他答案很好,但是您想在
textCol1
和textCol2
的行之间获得共同单词的百分比。为了获得这一点,我们必须从行中检索所有令牌,并计算
textCol1
和textCol2
的行中的单词令牌之间的所有出现。第一行中常用单词的百分比必须为0.33,因为我们将set单词= {碗,蓝色,绿色}进行比较。
textCol1 < / code>和
textCol2 < / code>只有一个单词,共同common_words:{bowl},
因此我们得到:
#common_words / #all_words = 1 /3 = 0.33 = 0.33 < /代码>
一个示例:
输出:
I think the other answers are good, but you want to get the percentage of common words between an row of
textcol1
andtextcol2
.To obtain this we have to retrieve all tokens from row and count all occurrences between the word tokens in the row of
textcol1
andtextcol2
.The percentage of common words in the first row must be 0.33, because we compare against a the set words = {bowl, blue, green}.
textcol1
andtextcol2
got only one word in common, common_words : {bowl}As a result we get:
#common_words / #all_words = 1 / 3 = 0.33
An example:
The output:
这是一种快速而肮脏的方式。但是,可能需要根据文本进行调整,以及如何定义与机器人指向的交叉点。
Here's a quick and dirty way. but might need to be adjusted based on the text, and how you define an intersection to not a robots points.