将 groupby 中第一行的 NaN 值替换为包含特定值的下一行的值 - Python

发布于 2025-01-12 09:31:38 字数 5924 浏览 0 评论 0原文

我有一个如下所示的 DataFrame，

email              month      level
[email protected]    jan        EE2 
[email protected]    jan        nan
[email protected]    mar        MG1 
[email protected]   jan        nan
[email protected]    jan        nan
[email protected]    jun        EE3
[email protected]   jan        nan
[email protected]   apr        PT 
[email protected]   jul        MG1
[email protected]   aug        MG1
[email protected]   sep        MG2 
[email protected]    sep        MG3

我计划执行 groupby 来选择每个组的第一行和最后一行。

但在此之前，我想将其中每个员工的第一行替换为“nan”，仅当下一行包含“EE”或“MG”时，

我正在考虑创建一个新列称为 level_new ，

email              month      level     level_new
[email protected]    jan        EE2       EE2
[email protected]    jan        nan       EE3
[email protected]    mar        MG1       MG1
[email protected]   jan        nan       MG1
[email protected]    jan        nan       nan
[email protected]    jun        EE3       EE3
[email protected]   jan        nan       MG1
[email protected]   apr        PT        PT
[email protected]   jul        MG1       MG1
[email protected]   aug        MG1       MG1
[email protected]   oct        MG2       MG2  
[email protected]    sep        MG3       MG3

这样我就可以实现以下 groupby

email              month      level     level_new
[email protected]    jan        EE2       EE2
[email protected]    mar        MG1       MG1
[email protected]    jan        nan       EE3
[email protected]    sep        MG3       MG3
[email protected]   jan        nan       MG1
[email protected]   oct        MG2       MG2  
[email protected]    jan        nan       nan
[email protected]   jan        nan       MG1
[email protected]   jul        MG1       MG1

到目前为止，我只能根据分组依据选择第一行和最后一行，但这仍然会选择每个员工第一行的 nan 值。

#get the first and last row of each group
#".nth[-1]" retrieves the last row
#".nth[0]" retrieves the first row
df2 = df.groupby('email', as_index=False).nth([0,-1])

原文

I have a DataFrame which looks like the following

email              month      level
[email protected]    jan        EE2 
[email protected]    jan        nan
[email protected]    mar        MG1 
[email protected]   jan        nan
[email protected]    jan        nan
[email protected]    jun        EE3
[email protected]   jan        nan
[email protected]   apr        PT 
[email protected]   jul        MG1
[email protected]   aug        MG1
[email protected]   sep        MG2 
[email protected]    sep        MG3

I plan to do a groupby to select the first and last rows for each group.

But before I do that, I'd like to replace the first rows of each employee in the that are "nan", with the next row only if it contains either "EE" or "MG"

I was thinking of creating a new column called level_new

email              month      level     level_new
[email protected]    jan        EE2       EE2
[email protected]    jan        nan       EE3
[email protected]    mar        MG1       MG1
[email protected]   jan        nan       MG1
[email protected]    jan        nan       nan
[email protected]    jun        EE3       EE3
[email protected]   jan        nan       MG1
[email protected]   apr        PT        PT
[email protected]   jul        MG1       MG1
[email protected]   aug        MG1       MG1
[email protected]   oct        MG2       MG2  
[email protected]    sep        MG3       MG3

so that I could achieve the following groupby

email              month      level     level_new
[email protected]    jan        EE2       EE2
[email protected]    mar        MG1       MG1
[email protected]    jan        nan       EE3
[email protected]    sep        MG3       MG3
[email protected]   jan        nan       MG1
[email protected]   oct        MG2       MG2  
[email protected]    jan        nan       nan
[email protected]   jan        nan       MG1
[email protected]   jul        MG1       MG1

So far, I am only able to select the first and the last rows based on the group by, but this would still select the nan values for the first row of each employee.

#get the first and last row of each group
#".nth[-1]" retrieves the last row
#".nth[0]" retrieves the first row
df2 = df.groupby('email', as_index=False).nth([0,-1])

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

满地尘埃落定 2025-01-19 09:31:38

我们可以使用 where 将“MG”或“EE”以外的值替换为 NaN；然后 groupby + bfill + fillna 在“level”列中填充 NaN 值，下一个值是“MG”或“EE” ” 对于每个“电子邮件”。

然后使用groupby+应用一个lambda，以列表的形式获取每个“电子邮件”的第一个和最后一个值的索引+分解列表+drop_duplicates （如果某些电子邮件仅出现一次）创建一个掩码，该掩码为每个“电子邮件”的第一个和最后一个值返回 True，否则返回 False。然后使用这个掩码来过滤相关结果：

df['level_new'] = df['level'].fillna(df['level'].where(df['level'].str.contains('MG|EE')).groupby(df['email']).bfill())
out = df.loc[df.groupby('email')['level_new'].apply(lambda x: [x.index.min(), x.index.max()]).explode().drop_duplicates()]

输出：

               email month level level_new
4    [email protected]   jan   NaN       NaN
6   [email protected]   jan   NaN       MG1
8   [email protected]   jul   MG1       MG1
0    [email protected]   jan   EE2       EE2
2    [email protected]   mar   MG1       MG1
1    [email protected]   jan   NaN       EE3
11   [email protected]   sep   MG3       MG3
3   [email protected]   jan   NaN       MG1
10  [email protected]   sep   MG2       MG2

We could use where to replace values other than "MG" or "EE" with NaN; then groupby + bfill + fillna to fill in NaN values in "level" column with the next value that is either "MG" or "EE" for each "email".

Then use groupby + apply a lambda that fetches the index of the first and last value for each "email" as a list + explode the list + drop_duplicates (in case some emails appear only once) to create a mask that returns True for first and last values for each "email" and False otherwise. Then use this mask to filter the relevant outcome:

df['level_new'] = df['level'].fillna(df['level'].where(df['level'].str.contains('MG|EE')).groupby(df['email']).bfill())
out = df.loc[df.groupby('email')['level_new'].apply(lambda x: [x.index.min(), x.index.max()]).explode().drop_duplicates()]

Output:

               email month level level_new
4    [email protected]   jan   NaN       NaN
6   [email protected]   jan   NaN       MG1
8   [email protected]   jul   MG1       MG1
0    [email protected]   jan   EE2       EE2
2    [email protected]   mar   MG1       MG1
1    [email protected]   jan   NaN       EE3
11   [email protected]   sep   MG3       MG3
3   [email protected]   jan   NaN       MG1
10  [email protected]   sep   MG2       MG2

回复收藏 0 原文

拥抱我好吗 2025-01-19 09:31:38

定义以下函数来处理组：

def procGrp(grp):
    if grp.index.size == 1:    # single row only
        return grp
    if pd.isnull(grp.iat[0,2]):
        nxtLev = grp.iat[1,2]  # next "level"
        if ('EE' in nxtLev) or ('MG' in nxtLev):
            grp.iat[0,2] = nxtLev  # set in 1-st row
    # Return first and last row from this group
    return grp.loc[[grp.index[0], grp.index[-1]]]

然后按电子邮件对 DataFrame 进行分组并应用此函数：

result = df.groupby('email').apply(procGrp)

对于您的数据样本，结果为：

               email month level
4    [email protected]   jan   NaN
6   [email protected]   jan   MG1
8   [email protected]   jul   MG1
0    [email protected]   jan   EE2
2    [email protected]   mar   MG1
1    [email protected]   jan   EE3
11   [email protected]   sep   MG3
3   [email protected]   jan   NaN
10  [email protected]   sep   MG2

如您所见：

行为 [email protected] 也仍然 NaN，因为该组仅包含
单行，
[电子邮件受保护]的行 仍然有 NaN，因为下一行有 level
=='PT'。

您甚至不需要创建任何额外的列。

Define the following fuction to process a group:

def procGrp(grp):
    if grp.index.size == 1:    # single row only
        return grp
    if pd.isnull(grp.iat[0,2]):
        nxtLev = grp.iat[1,2]  # next "level"
        if ('EE' in nxtLev) or ('MG' in nxtLev):
            grp.iat[0,2] = nxtLev  # set in 1-st row
    # Return first and last row from this group
    return grp.loc[[grp.index[0], grp.index[-1]]]

Then group your DataFrame by email and apply this function:

result = df.groupby('email').apply(procGrp)

For your data sample the result is:

               email month level
4    [email protected]   jan   NaN
6   [email protected]   jan   MG1
8   [email protected]   jul   MG1
0    [email protected]   jan   EE2
2    [email protected]   mar   MG1
1    [email protected]   jan   EE3
11   [email protected]   sep   MG3
3   [email protected]   jan   NaN
10  [email protected]   sep   MG2

As you can see:

row for [email protected] has also still NaN, as this group contains only
a single row,
row for [email protected] has still NaN, as the next row has level
== 'PT'.

You don't even need to create any additional column.

回复收藏 0 原文

~没有更多了~