解决 Pandas 数据框中的冲突

发布于 2025-01-16 19:17:34 字数 1114 浏览 3 评论 0原文

我正在数据帧上执行记录链接，例如：

ID_1     ID_2    Predicted Link     Probability
   1        0                 1             0.9
   1        1                 1             0.5
   1        2                 0               0
   2        1                 1             0.8
   2        5                 1             0.8
   3        1                 0               0
   3        2                 1             0.5

当我的模型过度预测并将相同的 ID_1 链接到多个 ID_2（由预测链接中的 1 表示）时，我想根据概率值解决冲突。如果一个预测链接的概率高于另一个，我想保留 1，但将该 ID_1 的其他预测链接值反转为 0。如果（最高）概率具有相同的值，我想反转所有预测链接链接值设置为 0。如果只有一个预测链接，则预测值应保持原样。

生成的数据框如下所示：

ID_1     ID_2    Predicted Link     Probability
   1        0                 1             0.9
   1        1                 0             0.5
   1        2                 0               0
   2        1                 0             0.8
   2        5                 0             0.8
   3        1                 0               0
   3        2                 1             0.5

我通过 pandas.groupby 进行分组，并尝试了 numpy.select 和 numpy.where 的一些变体，但没有运气。非常感谢任何帮助！

原文

I am performing record linkage on a dataframe such as:

ID_1     ID_2    Predicted Link     Probability
   1        0                 1             0.9
   1        1                 1             0.5
   1        2                 0               0
   2        1                 1             0.8
   2        5                 1             0.8
   3        1                 0               0
   3        2                 1             0.5

When my model overpredicts and links the same ID_1 to more than one ID_2 (indicated by a 1 in Predicted Link) I want to resolve the conflicts based on the Probability-value. If one predicted link has a higher probability than the other I want to keep a 1 for that, but reverse the other prediction link values for that ID_1 to 0. If the (highest) probabilities are of equal value I want to reverse all the predicted link values to 0. If only one predicted link then the predicted values should be left as they are.

The resulting dataframe would look like this:

ID_1     ID_2    Predicted Link     Probability
   1        0                 1             0.9
   1        1                 0             0.5
   1        2                 0               0
   2        1                 0             0.8
   2        5                 0             0.8
   3        1                 0               0
   3        2                 1             0.5

I am grouping via pandas.groupby, and tried some variations with numpy.select and numpy.where, but without luck. Any help much appreciated!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦言归人 2025-01-23 19:17:34

对于每个 ID_1，您希望保留一行且仅保留一行。因此，分组是一个好的开始。

首先让我们构建数据：

import pandas as pd
from io import StringIO

csvfile = StringIO(
"""ID_1\tID_2\tPredicted Link\tProbability
1\t0\t1\t0.9
1\t1\t1\t0.5
1\t2\t0\t0
2\t1\t1\t0.8
2\t5\t1\t0.8
3\t1\t0\t0
3\t2\t1\t0.5""")

df = pd.read_csv(csvfile, sep = '\t', engine='python')

我们想要为 ID_1 的每个值创建一个组，然后查找包含 ID_1 的所述值的概率最大值的行。让我们创建一个掩码：


max_proba = df.groupby("ID_1")["Probability"].transform(lambda x : x.eq(x.max()))

max_proba
Out[196]: 
0     True
1    False
2    False
3     True
4     True
5    False
6     True
Name: Probability, dtype: bool

考虑到您的规则，第 0、1、2 行和第 5、6 行有效（该 ID_1 值只有一个最大值），但第 3 行和第 4 行无效。让我们构建一个考虑这两个条件的掩码，如果是最大值，则为 True，如果只有一个最大值。

更准确地说，对于每个 ID_1，如果概率值重复，则它不能成为所述最大值的候选者。然后，我们将为每个 ID_1 值建立一个排除重复概率值的最大值

mask_unique = df.groupby(["ID_1", "Probability"])["Probability"].transform(lambda x : len(x) == 1)

mask_unique
Out[284]: 
0     True
1     True
2     True
3    False
4    False
5     True
6     True
Name: Probability, dtype: bool

。最后，让我们组合两个掩码：

df.loc[:, "Predicted Link"] = 1 * (mask_max_proba & mask_unique)

df
Out[285]: 
   ID_1  ID_2  Predicted Link  Probability
0     1     0               1          0.9
1     1     1               0          0.5
2     1     2               0          0.0
3     2     1               0          0.8
4     2     5               0          0.8
5     3     1               0          0.0
6     3     2               1          0.5

For each ID_1, you want to keep one and only one row. Thus, grouping is a good start.

First let's construct our data :

import pandas as pd
from io import StringIO

csvfile = StringIO(
"""ID_1\tID_2\tPredicted Link\tProbability
1\t0\t1\t0.9
1\t1\t1\t0.5
1\t2\t0\t0
2\t1\t1\t0.8
2\t5\t1\t0.8
3\t1\t0\t0
3\t2\t1\t0.5""")

df = pd.read_csv(csvfile, sep = '\t', engine='python')

We want to a group for each value of ID_1 and then looking for the row holding the max value of Probability for that said value of ID_1. Let's create a mask :


max_proba = df.groupby("ID_1")["Probability"].transform(lambda x : x.eq(x.max()))

max_proba
Out[196]: 
0     True
1    False
2    False
3     True
4     True
5    False
6     True
Name: Probability, dtype: bool

Considering your rules, rows 0, 1, 2 and rows 5, 6 are valid (only one max for that ID_1 value), but not the 3 and 4 rows. Let's build a mask that consider these two conditions, True if max value and if only one max value.

To be more accurate, for each ID_1, if a Probablity value is duplicated then it can't be a candidate for the said max. We will then build a max that exclude duplicates Probability value for each ID_1 value

mask_unique = df.groupby(["ID_1", "Probability"])["Probability"].transform(lambda x : len(x) == 1)

mask_unique
Out[284]: 
0     True
1     True
2     True
3    False
4    False
5     True
6     True
Name: Probability, dtype: bool

Finally, let's combine our two masks :

df.loc[:, "Predicted Link"] = 1 * (mask_max_proba & mask_unique)

df
Out[285]: 
   ID_1  ID_2  Predicted Link  Probability
0     1     0               1          0.9
1     1     1               0          0.5
2     1     2               0          0.0
3     2     1               0          0.8
4     2     5               0          0.8
5     3     1               0          0.0
6     3     2               1          0.5

回复收藏 0 原文

~没有更多了~