解决 Pandas 数据框中的冲突
我正在数据帧上执行记录链接,例如:
ID_1 ID_2 Predicted Link Probability
1 0 1 0.9
1 1 1 0.5
1 2 0 0
2 1 1 0.8
2 5 1 0.8
3 1 0 0
3 2 1 0.5
当我的模型过度预测并将相同的 ID_1 链接到多个 ID_2(由预测链接中的 1 表示)时,我想根据概率值解决冲突。如果一个预测链接的概率高于另一个,我想保留 1,但将该 ID_1 的其他预测链接值反转为 0。如果(最高)概率具有相同的值,我想反转所有预测链接链接值设置为 0。如果只有一个预测链接,则预测值应保持原样。
生成的数据框如下所示:
ID_1 ID_2 Predicted Link Probability
1 0 1 0.9
1 1 0 0.5
1 2 0 0
2 1 0 0.8
2 5 0 0.8
3 1 0 0
3 2 1 0.5
我通过 pandas.groupby 进行分组,并尝试了 numpy.select 和 numpy.where 的一些变体,但没有运气。非常感谢任何帮助!
I am performing record linkage on a dataframe such as:
ID_1 ID_2 Predicted Link Probability
1 0 1 0.9
1 1 1 0.5
1 2 0 0
2 1 1 0.8
2 5 1 0.8
3 1 0 0
3 2 1 0.5
When my model overpredicts and links the same ID_1 to more than one ID_2 (indicated by a 1 in Predicted Link) I want to resolve the conflicts based on the Probability-value. If one predicted link has a higher probability than the other I want to keep a 1 for that, but reverse the other prediction link values for that ID_1 to 0. If the (highest) probabilities are of equal value I want to reverse all the predicted link values to 0. If only one predicted link then the predicted values should be left as they are.
The resulting dataframe would look like this:
ID_1 ID_2 Predicted Link Probability
1 0 1 0.9
1 1 0 0.5
1 2 0 0
2 1 0 0.8
2 5 0 0.8
3 1 0 0
3 2 1 0.5
I am grouping via pandas.groupby, and tried some variations with numpy.select and numpy.where, but without luck. Any help much appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对于每个 ID_1,您希望保留一行且仅保留一行。因此,分组是一个好的开始。
首先让我们构建数据:
我们想要为 ID_1 的每个值创建一个组,然后查找包含 ID_1 的所述值的概率最大值的行。让我们创建一个掩码:
考虑到您的规则,第 0、1、2 行和第 5、6 行有效(该 ID_1 值只有一个最大值),但第 3 行和第 4 行无效。让我们构建一个考虑这两个条件的掩码,如果是最大值,则为 True,如果只有一个最大值。
更准确地说,对于每个 ID_1,如果概率值重复,则它不能成为所述最大值的候选者。然后,我们将为每个 ID_1 值建立一个排除重复概率值的最大值
。最后,让我们组合两个掩码:
For each ID_1, you want to keep one and only one row. Thus, grouping is a good start.
First let's construct our data :
We want to a group for each value of ID_1 and then looking for the row holding the max value of Probability for that said value of ID_1. Let's create a mask :
Considering your rules, rows 0, 1, 2 and rows 5, 6 are valid (only one max for that ID_1 value), but not the 3 and 4 rows. Let's build a mask that consider these two conditions,
True
if max value and if only one max value.To be more accurate, for each ID_1, if a Probablity value is duplicated then it can't be a candidate for the said max. We will then build a max that exclude duplicates Probability value for each ID_1 value
Finally, let's combine our two masks :