Pandas：如何从每组中删除选定的行并仅保留最近的行

发布于 2025-01-14 05:22:08 字数 1310 浏览 0 评论 0原文

我有以下数据框：

df1 = pd.DataFrame({"id": ['A1', 'A2', 'A3', 'A4', 'B1', 'B2', 'B3', 'B4', 'C1','C2','C3','C4'  ], 
                "date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30),  pd.Timestamp(2017, 12, 30), pd.Timestamp(2018, 12, 30),pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2017, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2017, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2019, 12, 30)], 
                "other_col": ['NA', 'NA', 'A333', 'A444', 'NA', 'NA', 'B555', 'B666', 'NA', 'C777', 'C888', 'C999'],
                "other_col_1": [123, 123, 'NA', 'NA', 0.765, 0.555, 'NA', 'NA', 0.324, 'NA', 'NA','NA']})

我想删除 id 列对应于“other_col”中两次值的行，并仅保留每个组的最近行。生成的数据框应该是：

df_new = pd.DataFrame({"id": ['A1', 'A2', 'A4', 'B1', 'B2', 'B4', 'C1','C4'  ], 
                "date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30),pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2019, 12, 30)], 
                "other_col": ['NA', 'NA', 'A444', 'NA', 'NA', 'B666', 'NA', 'C999'],
                "other_col_1": [123, 123, 'NA', 0.765, 0.555, 'NA', 0.324, 'NA']})

原文

I have the following dataframe:

df1 = pd.DataFrame({"id": ['A1', 'A2', 'A3', 'A4', 'B1', 'B2', 'B3', 'B4', 'C1','C2','C3','C4'  ], 
                "date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30),  pd.Timestamp(2017, 12, 30), pd.Timestamp(2018, 12, 30),pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2017, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2017, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2019, 12, 30)], 
                "other_col": ['NA', 'NA', 'A333', 'A444', 'NA', 'NA', 'B555', 'B666', 'NA', 'C777', 'C888', 'C999'],
                "other_col_1": [123, 123, 'NA', 'NA', 0.765, 0.555, 'NA', 'NA', 0.324, 'NA', 'NA','NA']})

I want to delete rows where id column corresponds to the value twice in "other_col" and to keep only the recent row for each group. The resulting data-frame should be:

df_new = pd.DataFrame({"id": ['A1', 'A2', 'A4', 'B1', 'B2', 'B4', 'C1','C4'  ], 
                "date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30),pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2019, 12, 30)], 
                "other_col": ['NA', 'NA', 'A444', 'NA', 'NA', 'B666', 'NA', 'C999'],
                "other_col_1": [123, 123, 'NA', 0.765, 0.555, 'NA', 0.324, 'NA']})

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

离去的眼神 2025-01-21 05:22:08

首先将值 NA 转换为 other_col 中的缺失值，如有必要，对每个 id 和 date 的值进行排序，这样可能通过 GroupBy.last 对于创建的不带数字的 id 组，最后一个过滤器会匹配 other_col 中缺少值的行：

df1['other_col'] = df1['other_col'].replace('NA', np.nan)
df1 = df1.sort_values(['id','date'])

s = df1.groupby(df1['id'].str.replace('\d',''))['other_col'].transform('last')
df_new = df1[df1['other_col'].eq(s) | df1['other_col'].isna()]
print (df_new)
    id       date other_col other_col_1
0   A1 2015-12-30       NaN         123
1   A2 2016-12-30       NaN         123
3   A4 2018-12-30      A444          NA
4   B1 2015-12-30       NaN       0.765
5   B2 2016-12-30       NaN       0.555
7   B4 2018-12-30      B666          NA
8   C1 2016-12-30       NaN       0.324
11  C4 2019-12-30      C999          NA

First convert values NA to missing values in other_col and if necessary sorting values per id and dates, so possible get last non missing value per other_col by GroupBy.last per groups created id without numbers, last filter match rows with missing values in other_col:

df1['other_col'] = df1['other_col'].replace('NA', np.nan)
df1 = df1.sort_values(['id','date'])

s = df1.groupby(df1['id'].str.replace('\d',''))['other_col'].transform('last')
df_new = df1[df1['other_col'].eq(s) | df1['other_col'].isna()]
print (df_new)
    id       date other_col other_col_1
0   A1 2015-12-30       NaN         123
1   A2 2016-12-30       NaN         123
3   A4 2018-12-30      A444          NA
4   B1 2015-12-30       NaN       0.765
5   B2 2016-12-30       NaN       0.555
7   B4 2018-12-30      B666          NA
8   C1 2016-12-30       NaN       0.324
11  C4 2019-12-30      C999          NA

回复收藏 0 原文

憧憬巴黎街头的黎明 2025-01-21 05:22:08

IIUC，您可以 groupby 字母和 NA 状态并获取 last：

df2 = df1.groupby([df1['id'].str[0], df1['other_col'].eq('NA')],
                  sort=False, as_index=False).last()

输出：

   id       date other_col
0  A1 2016-12-30        NA
1  A3 2018-12-30       444
2  B1 2016-12-30        NA
3  B3 2018-12-30       222
4  C1 2016-12-30        NA
5  C4 2019-12-30       888

获取 id 的更通用方法：df1['id'].str.extract('^(\D)', Expand=False)

如果 other_col 中有真正的 NaN，请使用 df1['other_col'].isna()

IIUC, you can groupby the letter and NA status and get the last:

df2 = df1.groupby([df1['id'].str[0], df1['other_col'].eq('NA')],
                  sort=False, as_index=False).last()

output:

   id       date other_col
0  A1 2016-12-30        NA
1  A3 2018-12-30       444
2  B1 2016-12-30        NA
3  B3 2018-12-30       222
4  C1 2016-12-30        NA
5  C4 2019-12-30       888

For a more generic way to get the id: df1['id'].str.extract('^(\D)', expand=False)

If you have real NaNs in other_col, use df1['other_col'].isna()

回复收藏 0 原文

~没有更多了~