如何获得熊猫DF中两个文本列的交点

发布于 2025-02-13 05:28:13 字数 875 浏览 1 评论 0原文

我有一个看起来像这样的DF：

id  textcol1             textcol2             ... coln
1   blue bowl            green bowl           ... xxx
2   purple sheet         green grass          ... xxx
3   ground black pepper  ground black pepper  ... xxx

等等...

想获得TextCol1和TextCol2之间的常见单词的百分比

id  textcol1             textcol2             ... coln intersection
1   blue bowl            green bowl           ... xxx  50
2   purple sheet         green grass          ... xxx  0
3   ground black pepper  ground black pepper  ... xxx  100

我想在很长一段时间内提出以下解决方案之后，

df['intersection'] = [(len(set(a) & set(b)) / float(len(set(a) | set(b))) * 100) for a, b in zip(df.textcol1, df.textcol2)]

我预计，例如，两次通过“地面黑胡椒”会产生93.3333333333330。

我已经完成了所有通常的清洁步骤 - 删除空格等 - 但是无法弄清楚这里的问题是什么。

我想念什么？

原文

I have a df that looks like this:

id  textcol1             textcol2             ... coln
1   blue bowl            green bowl           ... xxx
2   purple sheet         green grass          ... xxx
3   ground black pepper  ground black pepper  ... xxx

and so on...

I want to get the percentage of common words between textcol1 and textcol2

id  textcol1             textcol2             ... coln intersection
1   blue bowl            green bowl           ... xxx  50
2   purple sheet         green grass          ... xxx  0
3   ground black pepper  ground black pepper  ... xxx  100

After an embarrassingly long time I've come up with the following solution

df['intersection'] = [(len(set(a) & set(b)) / float(len(set(a) | set(b))) * 100) for a, b in zip(df.textcol1, df.textcol2)]

But the results are not what I would expect, for example passing "ground black pepper" twice yields 93.33333333333330.

I've gone through all the usual cleaning steps - removing whitespace, etc. - but can't figure out what the issue is here.

What am I missing?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

木槿暧夏七纪年 2025-02-20 05:28:13

考虑首先编写通用文本比较函数，例如text_diff，它计算两个文本之间的简单令牌重叠（aka set> set of tokens s

def text_diff(text1, text2):
    return 100 * len(text1.intersection(text2)) / min(map(len, (text1, text2)))

）您要比较并将它们变成一个令牌集的列，例如，

df2 = df.filter(like="textcol").applymap(str.split).applymap(set)

结果：

                   textcol1                 textcol2
id                                                  
1              {bowl, blue}            {bowl, green}
2           {sheet, purple}           {grass, green}
3   {pepper, black, ground}  {pepper, black, ground}

因此，您可以通过这种方式轻松地应用功能，

>>> df2.apply(lambda row: text_diff(*row), axis=1)
id
1     50.0
2      0.0
3    100.0
dtype: float64

可以轻松调整和/或替换您的text_diff函数。也对文本相似性度量进行一些研究，并使用现有工具（如果适用）。 <

Consider writing a generic text comparison function first, something like text_diff below that computes a simple token overlap between two texts (aka sets of tokens):

def text_diff(text1, text2):
    return 100 * len(text1.intersection(text2)) / min(map(len, (text1, text2)))

Then you can get the two columns you want to compare and turn them into sets of tokens, e.g.,

df2 = df.filter(like="textcol").applymap(str.split).applymap(set)

Result:

                   textcol1                 textcol2
id                                                  
1              {bowl, blue}            {bowl, green}
2           {sheet, purple}           {grass, green}
3   {pepper, black, ground}  {pepper, black, ground}

So you can easily apply the function by doing

>>> df2.apply(lambda row: text_diff(*row), axis=1)
id
1     50.0
2      0.0
3    100.0
dtype: float64

That way you can easily tweak and/or replace your text_diff function. Do some research on text similarity measures, too, and use existing tools if applicable. fuzzywuzzy could be worth a shot, too.

回复收藏 0 原文

你是我的挚爱i 2025-02-20 05:28:13

我认为其他答案很好，但是您想在textCol1和textCol2的行之间获得共同单词的百分比。

为了获得这一点，我们必须从行中检索所有令牌，并计算textCol1和textCol2的行中的单词令牌之间的所有出现。

第一行中常用单词的百分比必须为0.33，因为我们将set单词= {碗，蓝色，绿色}进行比较。
textCol1 < / code>和textCol2 < / code>只有一个单词，共同common_words：{bowl}，

因此我们得到：#common_words / #all_words = 1 /3 = 0.33 = 0.33 < /代码>

一个示例：

from functools import reduce
from operator import add


def fun(text1, text2):
    text1_tokens = text1.split(' ')
    text2_tokens = text2.split(' ')
    text1_set = set(text1_tokens)
    text2_set = set(text2_tokens)

    text_intersect = list(set.intersection(text1_set, text2_set))
    all_tokens = list(set.union(text1_set, text2_set))
    common_token_count =  list(map(lambda x: all_tokens.count(x), text_intersect))

    if len(common_token_count) > 0:
        common_token_count = reduce(add, common_token_count)
        return f"{common_token_count/len(all_tokens):.2f}"
    else:
        return 0.00
    

df["intersection"] = df.apply(lambda x: fun(x["text1"], x["text2"]), axis=1)

输出：

0   blue bowl   green bowl  0.33
1   purple sheet    green grass 0.00
2   ground black pepper ground black pepper 1.00

I think the other answers are good, but you want to get the percentage of common words between an row of textcol1 and textcol2.

To obtain this we have to retrieve all tokens from row and count all occurrences between the word tokens in the row of textcol1 and textcol2.

The percentage of common words in the first row must be 0.33, because we compare against a the set words = {bowl, blue, green}.
textcol1 and textcol2 got only one word in common, common_words : {bowl}

As a result we get: #common_words / #all_words = 1 / 3 = 0.33

An example:

from functools import reduce
from operator import add


def fun(text1, text2):
    text1_tokens = text1.split(' ')
    text2_tokens = text2.split(' ')
    text1_set = set(text1_tokens)
    text2_set = set(text2_tokens)

    text_intersect = list(set.intersection(text1_set, text2_set))
    all_tokens = list(set.union(text1_set, text2_set))
    common_token_count =  list(map(lambda x: all_tokens.count(x), text_intersect))

    if len(common_token_count) > 0:
        common_token_count = reduce(add, common_token_count)
        return f"{common_token_count/len(all_tokens):.2f}"
    else:
        return 0.00
    

df["intersection"] = df.apply(lambda x: fun(x["text1"], x["text2"]), axis=1)

The output:

0   blue bowl   green bowl  0.33
1   purple sheet    green grass 0.00
2   ground black pepper ground black pepper 1.00

回复收藏 0 原文

喜爱纠缠 2025-02-20 05:28:13

这是一种快速而肮脏的方式。但是，可能需要根据文本进行调整，以及如何定义与机器人指向的交叉点。



def intersections(x):
    combined = x['textcol1'].split(' ') + x['textcol2'].split(' ') 
    total = {i:combined.count(i) for i in combined}
    return sum([v for v in total.values() if v != 1]) / len(combined) * 100

df['intersections'] = df.apply(intersections, axis=1)
print(df)

              textcol1             textcol2  intersections
0            blue bowl           green bowl           50.0
1         purple sheet          green grass            0.0
2  ground black pepper  ground black pepper          100.0

Here's a quick and dirty way. but might need to be adjusted based on the text, and how you define an intersection to not a robots points.



def intersections(x):
    combined = x['textcol1'].split(' ') + x['textcol2'].split(' ') 
    total = {i:combined.count(i) for i in combined}
    return sum([v for v in total.values() if v != 1]) / len(combined) * 100

df['intersections'] = df.apply(intersections, axis=1)
print(df)

              textcol1             textcol2  intersections
0            blue bowl           green bowl           50.0
1         purple sheet          green grass            0.0
2  ground black pepper  ground black pepper          100.0

回复收藏 0 原文

~没有更多了~