熊猫用最后可用值替换列值

发布于 2025-01-18 11:16:54 字数 1568 浏览 1 评论 0原文

对于可以使用下面代码派生的示例数据框架,我想更新列offset_date,以便在列中的任何日期offset_date不在列date中我想在offset_date中替换该日期,其中最后一个可用值date

data = {"date": ['2021-01-01', '2021-01-03', '2021-01-04', '2021-01-05',
                 '2021-01-07', '2021-01-09', '2021-01-10', '2021-01-11'],

        "offset_date": ['2021-01-02', '2021-01-04', '2021-01-05',
                        '2021-01-06', '2021-01-08', '2021-01-10',
                        '2021-01-11', '2021-01-12']}

test_df = pd.DataFrame(data)
test_df['date'] = pd.to_datetime(test_df['date'])
test_df['offset_date'] = pd.to_datetime(test_df['offset_date'])

要在上述数据框架的第一行中进一步解释2021-01-02不在列date中,因此我想用列中的最后一个可用值替换该值>日期 IE 2021-01-01

我想执行一种矢量化方法,因此我尝试了以下操作,从而导致结果不正确。

test_df['offset_date_upd'] = np.where(test_df['offset_date'] in test_df['date'].values,
                                      test_df['offset_date'], 
                                      test_df[test_df['date'] <= test_df['offset_date']].values.max())

如何使用矢量化方法获得以下所需的输出?

所需的输出

+------------+-------------+
|    Date    | Offset_Date |
+------------+-------------+
| 2021-01-01 | 2021-01-01  |
| 2021-03-01 | 2021-04-01  |
| 2021-04-01 | 2021-05-01  |
| 2021-05-01 | 2021-05-01  |
| 2021-07-01 | 2021-07-01  |
| 2021-09-01 | 2021-10-01  |
| 2021-10-01 | 2021-11-01  |
| 2021-11-01 | 2021-11-01  |
+------------+-------------+

For sample data frame that can be derived using code below, I want to update the column Offset_Date such that for any date in column Offset_Date that is not within column Date I want to replace that date in Offset_Date with last available value in column Date.

data = {"date": ['2021-01-01', '2021-01-03', '2021-01-04', '2021-01-05',
                 '2021-01-07', '2021-01-09', '2021-01-10', '2021-01-11'],

        "offset_date": ['2021-01-02', '2021-01-04', '2021-01-05',
                        '2021-01-06', '2021-01-08', '2021-01-10',
                        '2021-01-11', '2021-01-12']}

test_df = pd.DataFrame(data)
test_df['date'] = pd.to_datetime(test_df['date'])
test_df['offset_date'] = pd.to_datetime(test_df['offset_date'])

To explain further in 1st row of above data frame date 2021-01-02 is not within column date so I want to replace that value with last available value in column date i.e. 2021-01-01.

I want to perform a vectorized approach so I tried the following, which lead to incorrect results.

test_df['offset_date_upd'] = np.where(test_df['offset_date'] in test_df['date'].values,
                                      test_df['offset_date'], 
                                      test_df[test_df['date'] <= test_df['offset_date']].values.max())

How can I get the below desired output using a vectorized approach?

Desired Output

+------------+-------------+
|    Date    | Offset_Date |
+------------+-------------+
| 2021-01-01 | 2021-01-01  |
| 2021-03-01 | 2021-04-01  |
| 2021-04-01 | 2021-05-01  |
| 2021-05-01 | 2021-05-01  |
| 2021-07-01 | 2021-07-01  |
| 2021-09-01 | 2021-10-01  |
| 2021-10-01 | 2021-11-01  |
| 2021-11-01 | 2021-11-01  |
+------------+-------------+

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

墨小墨 2025-01-25 11:16:54

IIUC,您可以使用 isinmaskfillna

test_df['offset_date'] = (test_df['offset_date']
                          .where(test_df['offset_date'].isin(test_df['date']),
                                 test_df['date'])
                         )

输出:

        date offset_date
0 2021-01-01  2021-01-01
1 2021-01-03  2021-01-04
2 2021-01-04  2021-01-05
3 2021-01-05  2021-01-05
4 2021-01-07  2021-01-07
5 2021-01-09  2021-01-10
6 2021-01-10  2021-01-11
7 2021-01-11  2021-01-11

IIUC, you can use isin, mask, and fillna:

test_df['offset_date'] = (test_df['offset_date']
                          .where(test_df['offset_date'].isin(test_df['date']),
                                 test_df['date'])
                         )

output:

        date offset_date
0 2021-01-01  2021-01-01
1 2021-01-03  2021-01-04
2 2021-01-04  2021-01-05
3 2021-01-05  2021-01-05
4 2021-01-07  2021-01-07
5 2021-01-09  2021-01-10
6 2021-01-10  2021-01-11
7 2021-01-11  2021-01-11
初见 2025-01-25 11:16:54

这是pandas' 函数。
我们必须具体说明哪些列要去哪里。这将用作左联接,在这种情况下,我们希望'offset_date'表示“左”。然后,对于'offset_date'中的每个值,我们在'date'中寻找最大的值,该值在'offset_date中不超过该值>。

这种方法中的一个陷阱是两个列都需要分类。如果这是真实数据的问题,请告诉我,我将在底部添加一个处理此问题的部分。

new_offset_date = pd.merge_asof(
    test_df[['offset_date']],  # limit `test_df` to just the column I need for the merge
    test_df[['date']],         # limit `test_df` to just the other column I need
    left_on=['offset_date'],   # name the columns since they have different names
    right_on=['date']          # name the other column as well
)['date']

现在,新的日期在pandas.series中,如果我们使用分配方法, coptrite datafrite且您'll需要使用test_df = test_df.assign(offset_date = new_offset_date)实际上持续了dataframe中的新日期。

test_df.assign(offset_date=new_offset_date)

        date offset_date
0 2021-01-01  2021-01-01
1 2021-01-03  2021-01-04
2 2021-01-04  2021-01-05
3 2021-01-05  2021-01-05
4 2021-01-07  2021-01-07
5 2021-01-09  2021-01-10
6 2021-01-10  2021-01-11
7 2021-01-11  2021-01-11

This is the purpose of Pandas' merge_asof function.
We have to be specific about which columns are going where. This will work as a left join and in this case, we want 'offset_date' to represent the "left". Then, for each value in 'offset_date', we look for the greatest value in 'date' that does not exceed that value in 'offset_date'.

The one gotcha in this approach is that both columns need to be sorted. If this is an issue with the real data, let me know and I'll add a section at the bottom that deals with this.

new_offset_date = pd.merge_asof(
    test_df[['offset_date']],  # limit `test_df` to just the column I need for the merge
    test_df[['date']],         # limit `test_df` to just the other column I need
    left_on=['offset_date'],   # name the columns since they have different names
    right_on=['date']          # name the other column as well
)['date']

Now the new dates are in a pandas.Series, we can see it if we use the assign method which DOES NOT overwrite your dataframe and you'll need to use test_df = test_df.assign(offset_date=new_offset_date) to actually persist the new dates in the dataframe.

test_df.assign(offset_date=new_offset_date)

        date offset_date
0 2021-01-01  2021-01-01
1 2021-01-03  2021-01-04
2 2021-01-04  2021-01-05
3 2021-01-05  2021-01-05
4 2021-01-07  2021-01-07
5 2021-01-09  2021-01-10
6 2021-01-10  2021-01-11
7 2021-01-11  2021-01-11
仅此而已 2025-01-25 11:16:54

以下方法应该适合您的情况

test_df["offset_date"] = np.where(
    test_df.offset_date.isin(test_df.date),
    test_df.offset_date,
    test_df.date
)

The bellow approach should works for your case

test_df["offset_date"] = np.where(
    test_df.offset_date.isin(test_df.date),
    test_df.offset_date,
    test_df.date
)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文