将 groupby 中第一行的 NaN 值替换为包含特定值的下一行的值 - Python
我有一个如下所示的 DataFrame,
email month level
[email protected] jan EE2
[email protected] jan nan
[email protected] mar MG1
[email protected] jan nan
[email protected] jan nan
[email protected] jun EE3
[email protected] jan nan
[email protected] apr PT
[email protected] jul MG1
[email protected] aug MG1
[email protected] sep MG2
[email protected] sep MG3
我计划执行 groupby
来选择每个组的第一行和最后一行。
但在此之前,我想将其中每个员工的第一行替换为“nan”,仅当下一行包含“EE”或“MG”时,
我正在考虑创建一个新列称为 level_new
,
email month level level_new
[email protected] jan EE2 EE2
[email protected] jan nan EE3
[email protected] mar MG1 MG1
[email protected] jan nan MG1
[email protected] jan nan nan
[email protected] jun EE3 EE3
[email protected] jan nan MG1
[email protected] apr PT PT
[email protected] jul MG1 MG1
[email protected] aug MG1 MG1
[email protected] oct MG2 MG2
[email protected] sep MG3 MG3
这样我就可以实现以下 groupby
email month level level_new
[email protected] jan EE2 EE2
[email protected] mar MG1 MG1
[email protected] jan nan EE3
[email protected] sep MG3 MG3
[email protected] jan nan MG1
[email protected] oct MG2 MG2
[email protected] jan nan nan
[email protected] jan nan MG1
[email protected] jul MG1 MG1
到目前为止,我只能根据分组依据选择第一行和最后一行,但这仍然会选择每个员工第一行的 nan 值。
#get the first and last row of each group
#".nth[-1]" retrieves the last row
#".nth[0]" retrieves the first row
df2 = df.groupby('email', as_index=False).nth([0,-1])
I have a DataFrame which looks like the following
email month level
[email protected] jan EE2
[email protected] jan nan
[email protected] mar MG1
[email protected] jan nan
[email protected] jan nan
[email protected] jun EE3
[email protected] jan nan
[email protected] apr PT
[email protected] jul MG1
[email protected] aug MG1
[email protected] sep MG2
[email protected] sep MG3
I plan to do a groupby
to select the first and last rows for each group.
But before I do that, I'd like to replace the first rows of each employee in the that are "nan", with the next row only if it contains either "EE" or "MG"
I was thinking of creating a new column called level_new
email month level level_new
[email protected] jan EE2 EE2
[email protected] jan nan EE3
[email protected] mar MG1 MG1
[email protected] jan nan MG1
[email protected] jan nan nan
[email protected] jun EE3 EE3
[email protected] jan nan MG1
[email protected] apr PT PT
[email protected] jul MG1 MG1
[email protected] aug MG1 MG1
[email protected] oct MG2 MG2
[email protected] sep MG3 MG3
so that I could achieve the following groupby
email month level level_new
[email protected] jan EE2 EE2
[email protected] mar MG1 MG1
[email protected] jan nan EE3
[email protected] sep MG3 MG3
[email protected] jan nan MG1
[email protected] oct MG2 MG2
[email protected] jan nan nan
[email protected] jan nan MG1
[email protected] jul MG1 MG1
So far, I am only able to select the first and the last rows based on the group by, but this would still select the nan values for the first row of each employee.
#get the first and last row of each group
#".nth[-1]" retrieves the last row
#".nth[0]" retrieves the first row
df2 = df.groupby('email', as_index=False).nth([0,-1])
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我们可以使用
where
将“MG”或“EE”以外的值替换为 NaN;然后groupby
+bfill
+fillna
在“level”列中填充 NaN 值,下一个值是“MG”或“EE” ” 对于每个“电子邮件”。然后使用
groupby
+应用一个lambda,以列表的形式获取每个“电子邮件”的第一个和最后一个值的索引+分解
列表+drop_duplicates (如果某些电子邮件仅出现一次)创建一个掩码,该掩码为每个“电子邮件”的第一个和最后一个值返回 True,否则返回 False。然后使用这个掩码来过滤相关结果:
输出:
We could use
where
to replace values other than "MG" or "EE" with NaN; thengroupby
+bfill
+fillna
to fill in NaN values in "level" column with the next value that is either "MG" or "EE" for each "email".Then use
groupby
+ apply a lambda that fetches the index of the first and last value for each "email" as a list +explode
the list +drop_duplicates
(in case some emails appear only once) to create a mask that returns True for first and last values for each "email" and False otherwise. Then use this mask to filter the relevant outcome:Output:
定义以下函数来处理组:
然后按电子邮件对 DataFrame 进行分组并应用此函数:
对于您的数据样本,结果为:
如您所见:
单行,
=='PT'。
您甚至不需要创建任何额外的列。
Define the following fuction to process a group:
Then group your DataFrame by email and apply this function:
For your data sample the result is:
As you can see:
a single row,
== 'PT'.
You don't even need to create any additional column.