Pandas:如何从每组中删除选定的行并仅保留最近的行
我有以下数据框:
df1 = pd.DataFrame({"id": ['A1', 'A2', 'A3', 'A4', 'B1', 'B2', 'B3', 'B4', 'C1','C2','C3','C4' ],
"date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2017, 12, 30), pd.Timestamp(2018, 12, 30),pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2017, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2017, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2019, 12, 30)],
"other_col": ['NA', 'NA', 'A333', 'A444', 'NA', 'NA', 'B555', 'B666', 'NA', 'C777', 'C888', 'C999'],
"other_col_1": [123, 123, 'NA', 'NA', 0.765, 0.555, 'NA', 'NA', 0.324, 'NA', 'NA','NA']})
我想删除 id 列对应于“other_col”中两次值的行,并仅保留每个组的最近行。生成的数据框应该是:
df_new = pd.DataFrame({"id": ['A1', 'A2', 'A4', 'B1', 'B2', 'B4', 'C1','C4' ],
"date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30),pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2019, 12, 30)],
"other_col": ['NA', 'NA', 'A444', 'NA', 'NA', 'B666', 'NA', 'C999'],
"other_col_1": [123, 123, 'NA', 0.765, 0.555, 'NA', 0.324, 'NA']})
I have the following dataframe:
df1 = pd.DataFrame({"id": ['A1', 'A2', 'A3', 'A4', 'B1', 'B2', 'B3', 'B4', 'C1','C2','C3','C4' ],
"date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2017, 12, 30), pd.Timestamp(2018, 12, 30),pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2017, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2017, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2019, 12, 30)],
"other_col": ['NA', 'NA', 'A333', 'A444', 'NA', 'NA', 'B555', 'B666', 'NA', 'C777', 'C888', 'C999'],
"other_col_1": [123, 123, 'NA', 'NA', 0.765, 0.555, 'NA', 'NA', 0.324, 'NA', 'NA','NA']})
I want to delete rows where id column corresponds to the value twice in "other_col" and to keep only the recent row for each group. The resulting data-frame should be:
df_new = pd.DataFrame({"id": ['A1', 'A2', 'A4', 'B1', 'B2', 'B4', 'C1','C4' ],
"date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30),pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2019, 12, 30)],
"other_col": ['NA', 'NA', 'A444', 'NA', 'NA', 'B666', 'NA', 'C999'],
"other_col_1": [123, 123, 'NA', 0.765, 0.555, 'NA', 0.324, 'NA']})
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
首先将值
NA
转换为other_col
中的缺失值,如有必要,对每个id
和date
的值进行排序,这样可能通过GroupBy.last
对于创建的不带数字的id
组,最后一个过滤器会匹配other_col
中缺少值的行:First convert values
NA
to missing values inother_col
and if necessary sorting values perid
anddate
s, so possible get last non missing value perother_col
byGroupBy.last
per groups createdid
without numbers, last filter match rows with missing values inother_col
:IIUC,您可以
groupby
字母和 NA 状态并获取last
:输出:
获取 id 的更通用方法:
df1['id'].str.extract('^(\D)', Expand=False)
如果 other_col 中有真正的 NaN,请使用
df1['other_col'].isna()
IIUC, you can
groupby
the letter and NA status and get thelast
:output:
For a more generic way to get the id:
df1['id'].str.extract('^(\D)', expand=False)
If you have real NaNs in other_col, use
df1['other_col'].isna()