基于一组 pandas 回填列

发布于 2025-01-14 17:33:27 字数 1096 浏览 3 评论 0原文

我正在使用以下数据框:

df = pd.DataFrame({"id": ['A', 'A', 'A', 'B', 'B', 'B', 'C','C'  ], 
            "date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30),pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2019, 12, 30)], 
            "other_col": ['NA', 'NA', 'A444', 'NA', 'NA', 'B666', 'NA', 'C999'],
            "other_col_1": [123, 123, 'NA', 0.765, 0.555, 'NA', 0.324, 'NA']})

我想要实现的是:回填每个相应组的“other_col”条目,并在“other_col”等于“other_col_1”中的“NA”时删除“other_col”。

我尝试过 groupby bfill() 和 ffill() df.groupby('id')['other_col'].bfill() 但它不起作用。

生成的数据框应如下所示:

df_new = pd.DataFrame({"id": ['A', 'A', 'B', 'B', 'C' ], 
            "date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2016, 12, 30)], 
            "other_col": ['A444', 'A444', 'B666', 'B666', 'C999'],
            "other_col_1": [123, 123, 0.765, 0.555, 0.324]})

I am working with the following dataframe:

df = pd.DataFrame({"id": ['A', 'A', 'A', 'B', 'B', 'B', 'C','C'  ], 
            "date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30),pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2018, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2019, 12, 30)], 
            "other_col": ['NA', 'NA', 'A444', 'NA', 'NA', 'B666', 'NA', 'C999'],
            "other_col_1": [123, 123, 'NA', 0.765, 0.555, 'NA', 0.324, 'NA']})

What I want to achieve is: To backfill "other_col" entries for each corresponding group and to delete "other_col" when it is equal to 'NA' in "other_col_1".

I have tried groupby bfill() and ffill() df.groupby('id')['other_col'].bfill() but it does't work.

The resulting dataframe should look like this:

df_new = pd.DataFrame({"id": ['A', 'A', 'B', 'B', 'C' ], 
            "date": [pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2015, 12, 30), pd.Timestamp(2016, 12, 30), pd.Timestamp(2016, 12, 30)], 
            "other_col": ['A444', 'A444', 'B666', 'B666', 'C999'],
            "other_col_1": [123, 123, 0.765, 0.555, 0.324]})

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

拥抱我好吗 2025-01-21 17:33:27

首先,将 'NA' 替换为真实的 NaN 值,然后 bfill

df = df.replace('NA', np.nan)
df = df.bfill()[df['other_col_1'].notna()]

输出:

>>> df
  id       date other_col  other_col_1
0  A 2015-12-30      A444      123.000
1  A 2016-12-30      A444      123.000
3  B 2015-12-30      B666        0.765
4  B 2016-12-30      B666        0.555
6  C 2016-12-30      C999        0.324

First, replace 'NA' with a real NaN value, then bfill:

df = df.replace('NA', np.nan)
df = df.bfill()[df['other_col_1'].notna()]

Output:

>>> df
  id       date other_col  other_col_1
0  A 2015-12-30      A444      123.000
1  A 2016-12-30      A444      123.000
3  B 2015-12-30      B666        0.765
4  B 2016-12-30      B666        0.555
6  C 2016-12-30      C999        0.324
各空 2025-01-21 17:33:27

IIUC,你可以这样做:

out = (
 df.replace('NA', pd.NA) # ensure real NA
   .assign(other_col=lambda d: d['other_col'].bfill()) # backfill other_col
   .dropna(subset=['other_col_1']) # drop rows based on other_col_1
)

或者,每组bfill

(df.replace('NA', pd.NA)
   .assign(other_col=lambda d: d.groupby(d['id'].str.replace('\d+', '', regex=True))
                                ['other_col'].bfill())
   .dropna(subset=['other_col_1'])
)

输出:

   id       date other_col other_col_1
0  A1 2015-12-30      A444         123
1  A2 2016-12-30      A444         123
3  B1 2015-12-30      B666       0.765
4  B2 2016-12-30      B666       0.555
6  C1 2016-12-30      C999       0.324

IIUC, you could do:

out = (
 df.replace('NA', pd.NA) # ensure real NA
   .assign(other_col=lambda d: d['other_col'].bfill()) # backfill other_col
   .dropna(subset=['other_col_1']) # drop rows based on other_col_1
)

or, to bfill per group:

(df.replace('NA', pd.NA)
   .assign(other_col=lambda d: d.groupby(d['id'].str.replace('\d+', '', regex=True))
                                ['other_col'].bfill())
   .dropna(subset=['other_col_1'])
)

output:

   id       date other_col other_col_1
0  A1 2015-12-30      A444         123
1  A2 2016-12-30      A444         123
3  B1 2015-12-30      B666       0.765
4  B2 2016-12-30      B666       0.555
6  C1 2016-12-30      C999       0.324
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文