将 groupby 中第一行的 NaN 值替换为包含特定值的下一行的值 - Python

发布于 2025-01-12 09:31:38 字数 5924 浏览 0 评论 0原文

我有一个如下所示的 DataFrame,

email              month      level
[email protected]    jan        EE2 
[email protected]    jan        nan
[email protected]    mar        MG1 
[email protected]   jan        nan
[email protected]    jan        nan
[email protected]    jun        EE3
[email protected]   jan        nan
[email protected]   apr        PT 
[email protected]   jul        MG1
[email protected]   aug        MG1
[email protected]   sep        MG2 
[email protected]    sep        MG3

我计划执行 groupby 来选择每个组的第一行和最后一行。

但在此之前,我想将其中每个员工的第一行替换为“nan”,仅当下一行包含“EE”或“MG”时,

我正在考虑创建一个新列称为 level_new

email              month      level     level_new
[email protected]    jan        EE2       EE2
[email protected]    jan        nan       EE3
[email protected]    mar        MG1       MG1
[email protected]   jan        nan       MG1
[email protected]    jan        nan       nan
[email protected]    jun        EE3       EE3
[email protected]   jan        nan       MG1
[email protected]   apr        PT        PT
[email protected]   jul        MG1       MG1
[email protected]   aug        MG1       MG1
[email protected]   oct        MG2       MG2  
[email protected]    sep        MG3       MG3

这样我就可以实现以下 groupby

email              month      level     level_new
[email protected]    jan        EE2       EE2
[email protected]    mar        MG1       MG1
[email protected]    jan        nan       EE3
[email protected]    sep        MG3       MG3
[email protected]   jan        nan       MG1
[email protected]   oct        MG2       MG2  
[email protected]    jan        nan       nan
[email protected]   jan        nan       MG1
[email protected]   jul        MG1       MG1

到目前为止,我只能根据分组依据选择第一行和最后一行,但这仍然会选择每个员工第一行的 nan 值。

#get the first and last row of each group
#".nth[-1]" retrieves the last row
#".nth[0]" retrieves the first row
df2 = df.groupby('email', as_index=False).nth([0,-1]) 

I have a DataFrame which looks like the following

email              month      level
[email protected]    jan        EE2 
[email protected]    jan        nan
[email protected]    mar        MG1 
[email protected]   jan        nan
[email protected]    jan        nan
[email protected]    jun        EE3
[email protected]   jan        nan
[email protected]   apr        PT 
[email protected]   jul        MG1
[email protected]   aug        MG1
[email protected]   sep        MG2 
[email protected]    sep        MG3

I plan to do a groupby to select the first and last rows for each group.

But before I do that, I'd like to replace the first rows of each employee in the that are "nan", with the next row only if it contains either "EE" or "MG"

I was thinking of creating a new column called level_new

email              month      level     level_new
[email protected]    jan        EE2       EE2
[email protected]    jan        nan       EE3
[email protected]    mar        MG1       MG1
[email protected]   jan        nan       MG1
[email protected]    jan        nan       nan
[email protected]    jun        EE3       EE3
[email protected]   jan        nan       MG1
[email protected]   apr        PT        PT
[email protected]   jul        MG1       MG1
[email protected]   aug        MG1       MG1
[email protected]   oct        MG2       MG2  
[email protected]    sep        MG3       MG3

so that I could achieve the following groupby

email              month      level     level_new
[email protected]    jan        EE2       EE2
[email protected]    mar        MG1       MG1
[email protected]    jan        nan       EE3
[email protected]    sep        MG3       MG3
[email protected]   jan        nan       MG1
[email protected]   oct        MG2       MG2  
[email protected]    jan        nan       nan
[email protected]   jan        nan       MG1
[email protected]   jul        MG1       MG1

So far, I am only able to select the first and the last rows based on the group by, but this would still select the nan values for the first row of each employee.

#get the first and last row of each group
#".nth[-1]" retrieves the last row
#".nth[0]" retrieves the first row
df2 = df.groupby('email', as_index=False).nth([0,-1]) 

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

满地尘埃落定 2025-01-19 09:31:38

我们可以使用 where 将“MG”或“EE”以外的值替换为 NaN;然后 groupby + bfill + fillna 在“level”列中填充 NaN 值,下一个值是“MG”或“EE” ” 对于每个“电子邮件”。

然后使用groupby+应用一个lambda,以列表的形式获取每个“电子邮件”的第一个和最后一个值的索引+分解列表+drop_duplicates (如果某些电子邮件仅出现一次)创建一个掩码,该掩码为每个“电子邮件”的第一个和最后一个值返回 True,否则返回 False。然后使用这个掩码来过滤相关结果:

df['level_new'] = df['level'].fillna(df['level'].where(df['level'].str.contains('MG|EE')).groupby(df['email']).bfill())
out = df.loc[df.groupby('email')['level_new'].apply(lambda x: [x.index.min(), x.index.max()]).explode().drop_duplicates()]

输出:

               email month level level_new
4    [email protected]   jan   NaN       NaN
6   [email protected]   jan   NaN       MG1
8   [email protected]   jul   MG1       MG1
0    [email protected]   jan   EE2       EE2
2    [email protected]   mar   MG1       MG1
1    [email protected]   jan   NaN       EE3
11   [email protected]   sep   MG3       MG3
3   [email protected]   jan   NaN       MG1
10  [email protected]   sep   MG2       MG2

We could use where to replace values other than "MG" or "EE" with NaN; then groupby + bfill + fillna to fill in NaN values in "level" column with the next value that is either "MG" or "EE" for each "email".

Then use groupby + apply a lambda that fetches the index of the first and last value for each "email" as a list + explode the list + drop_duplicates (in case some emails appear only once) to create a mask that returns True for first and last values for each "email" and False otherwise. Then use this mask to filter the relevant outcome:

df['level_new'] = df['level'].fillna(df['level'].where(df['level'].str.contains('MG|EE')).groupby(df['email']).bfill())
out = df.loc[df.groupby('email')['level_new'].apply(lambda x: [x.index.min(), x.index.max()]).explode().drop_duplicates()]

Output:

               email month level level_new
4    [email protected]   jan   NaN       NaN
6   [email protected]   jan   NaN       MG1
8   [email protected]   jul   MG1       MG1
0    [email protected]   jan   EE2       EE2
2    [email protected]   mar   MG1       MG1
1    [email protected]   jan   NaN       EE3
11   [email protected]   sep   MG3       MG3
3   [email protected]   jan   NaN       MG1
10  [email protected]   sep   MG2       MG2
拥抱我好吗 2025-01-19 09:31:38

定义以下函数来处理组:

def procGrp(grp):
    if grp.index.size == 1:    # single row only
        return grp
    if pd.isnull(grp.iat[0,2]):
        nxtLev = grp.iat[1,2]  # next "level"
        if ('EE' in nxtLev) or ('MG' in nxtLev):
            grp.iat[0,2] = nxtLev  # set in 1-st row
    # Return first and last row from this group
    return grp.loc[[grp.index[0], grp.index[-1]]]

然后按电子邮件对 DataFrame 进行分组并应用此函数:

result = df.groupby('email').apply(procGrp)

对于您的数据样本,结果为:

               email month level
4    [email protected]   jan   NaN
6   [email protected]   jan   MG1
8   [email protected]   jul   MG1
0    [email protected]   jan   EE2
2    [email protected]   mar   MG1
1    [email protected]   jan   EE3
11   [email protected]   sep   MG3
3   [email protected]   jan   NaN
10  [email protected]   sep   MG2

如您所见:

您甚至不需要创建任何额外的列。

Define the following fuction to process a group:

def procGrp(grp):
    if grp.index.size == 1:    # single row only
        return grp
    if pd.isnull(grp.iat[0,2]):
        nxtLev = grp.iat[1,2]  # next "level"
        if ('EE' in nxtLev) or ('MG' in nxtLev):
            grp.iat[0,2] = nxtLev  # set in 1-st row
    # Return first and last row from this group
    return grp.loc[[grp.index[0], grp.index[-1]]]

Then group your DataFrame by email and apply this function:

result = df.groupby('email').apply(procGrp)

For your data sample the result is:

               email month level
4    [email protected]   jan   NaN
6   [email protected]   jan   MG1
8   [email protected]   jul   MG1
0    [email protected]   jan   EE2
2    [email protected]   mar   MG1
1    [email protected]   jan   EE3
11   [email protected]   sep   MG3
3   [email protected]   jan   NaN
10  [email protected]   sep   MG2

As you can see:

You don't even need to create any additional column.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文