熊猫用最后可用值替换列值

发布于 2025-01-18 11:16:54 字数 1568 浏览 1 评论 0原文

对于可以使用下面代码派生的示例数据框架，我想更新列offset_date，以便在列中的任何日期offset_date不在列date中我想在offset_date中替换该日期，其中最后一个可用值date。

data = {"date": ['2021-01-01', '2021-01-03', '2021-01-04', '2021-01-05',
                 '2021-01-07', '2021-01-09', '2021-01-10', '2021-01-11'],

        "offset_date": ['2021-01-02', '2021-01-04', '2021-01-05',
                        '2021-01-06', '2021-01-08', '2021-01-10',
                        '2021-01-11', '2021-01-12']}

test_df = pd.DataFrame(data)
test_df['date'] = pd.to_datetime(test_df['date'])
test_df['offset_date'] = pd.to_datetime(test_df['offset_date'])

要在上述数据框架的第一行中进一步解释2021-01-02不在列date中，因此我想用列中的最后一个可用值替换该值>日期 IE 2021-01-01。

我想执行一种矢量化方法，因此我尝试了以下操作，从而导致结果不正确。

test_df['offset_date_upd'] = np.where(test_df['offset_date'] in test_df['date'].values,
                                      test_df['offset_date'], 
                                      test_df[test_df['date'] <= test_df['offset_date']].values.max())

如何使用矢量化方法获得以下所需的输出？

所需的输出

+------------+-------------+
|    Date    | Offset_Date |
+------------+-------------+
| 2021-01-01 | 2021-01-01  |
| 2021-03-01 | 2021-04-01  |
| 2021-04-01 | 2021-05-01  |
| 2021-05-01 | 2021-05-01  |
| 2021-07-01 | 2021-07-01  |
| 2021-09-01 | 2021-10-01  |
| 2021-10-01 | 2021-11-01  |
| 2021-11-01 | 2021-11-01  |
+------------+-------------+

原文

For sample data frame that can be derived using code below, I want to update the column Offset_Date such that for any date in column Offset_Date that is not within column Date I want to replace that date in Offset_Date with last available value in column Date.

data = {"date": ['2021-01-01', '2021-01-03', '2021-01-04', '2021-01-05',
                 '2021-01-07', '2021-01-09', '2021-01-10', '2021-01-11'],

        "offset_date": ['2021-01-02', '2021-01-04', '2021-01-05',
                        '2021-01-06', '2021-01-08', '2021-01-10',
                        '2021-01-11', '2021-01-12']}

test_df = pd.DataFrame(data)
test_df['date'] = pd.to_datetime(test_df['date'])
test_df['offset_date'] = pd.to_datetime(test_df['offset_date'])

To explain further in 1st row of above data frame date 2021-01-02 is not within column date so I want to replace that value with last available value in column date i.e. 2021-01-01.

I want to perform a vectorized approach so I tried the following, which lead to incorrect results.

test_df['offset_date_upd'] = np.where(test_df['offset_date'] in test_df['date'].values,
                                      test_df['offset_date'], 
                                      test_df[test_df['date'] <= test_df['offset_date']].values.max())

How can I get the below desired output using a vectorized approach?

Desired Output

+------------+-------------+
|    Date    | Offset_Date |
+------------+-------------+
| 2021-01-01 | 2021-01-01  |
| 2021-03-01 | 2021-04-01  |
| 2021-04-01 | 2021-05-01  |
| 2021-05-01 | 2021-05-01  |
| 2021-07-01 | 2021-07-01  |
| 2021-09-01 | 2021-10-01  |
| 2021-10-01 | 2021-11-01  |
| 2021-11-01 | 2021-11-01  |
+------------+-------------+

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

墨小墨 2025-01-25 11:16:54

IIUC，您可以使用 isin、mask 和 fillna：

test_df['offset_date'] = (test_df['offset_date']
                          .where(test_df['offset_date'].isin(test_df['date']),
                                 test_df['date'])
                         )

输出：

        date offset_date
0 2021-01-01  2021-01-01
1 2021-01-03  2021-01-04
2 2021-01-04  2021-01-05
3 2021-01-05  2021-01-05
4 2021-01-07  2021-01-07
5 2021-01-09  2021-01-10
6 2021-01-10  2021-01-11
7 2021-01-11  2021-01-11

IIUC, you can use isin, mask, and fillna:

test_df['offset_date'] = (test_df['offset_date']
                          .where(test_df['offset_date'].isin(test_df['date']),
                                 test_df['date'])
                         )

output:

        date offset_date
0 2021-01-01  2021-01-01
1 2021-01-03  2021-01-04
2 2021-01-04  2021-01-05
3 2021-01-05  2021-01-05
4 2021-01-07  2021-01-07
5 2021-01-09  2021-01-10
6 2021-01-10  2021-01-11
7 2021-01-11  2021-01-11

回复收藏 0 原文

初见 2025-01-25 11:16:54

这是pandas' 函数。
我们必须具体说明哪些列要去哪里。这将用作左联接，在这种情况下，我们希望'offset_date'表示“左”。然后，对于'offset_date'中的每个值，我们在'date'中寻找最大的值，该值在'offset_date中不超过该值>。

这种方法中的一个陷阱是两个列都需要分类。如果这是真实数据的问题，请告诉我，我将在底部添加一个处理此问题的部分。

new_offset_date = pd.merge_asof(
    test_df[['offset_date']],  # limit `test_df` to just the column I need for the merge
    test_df[['date']],         # limit `test_df` to just the other column I need
    left_on=['offset_date'],   # name the columns since they have different names
    right_on=['date']          # name the other column as well
)['date']

现在，新的日期在pandas.series中，如果我们使用分配方法， coptrite datafrite且您'll需要使用test_df = test_df.assign（offset_date = new_offset_date）实际上持续了dataframe中的新日期。

test_df.assign(offset_date=new_offset_date)

        date offset_date
0 2021-01-01  2021-01-01
1 2021-01-03  2021-01-04
2 2021-01-04  2021-01-05
3 2021-01-05  2021-01-05
4 2021-01-07  2021-01-07
5 2021-01-09  2021-01-10
6 2021-01-10  2021-01-11
7 2021-01-11  2021-01-11

This is the purpose of Pandas' merge_asof function.
We have to be specific about which columns are going where. This will work as a left join and in this case, we want 'offset_date' to represent the "left". Then, for each value in 'offset_date', we look for the greatest value in 'date' that does not exceed that value in 'offset_date'.

The one gotcha in this approach is that both columns need to be sorted. If this is an issue with the real data, let me know and I'll add a section at the bottom that deals with this.

new_offset_date = pd.merge_asof(
    test_df[['offset_date']],  # limit `test_df` to just the column I need for the merge
    test_df[['date']],         # limit `test_df` to just the other column I need
    left_on=['offset_date'],   # name the columns since they have different names
    right_on=['date']          # name the other column as well
)['date']

Now the new dates are in a pandas.Series, we can see it if we use the assign method which DOES NOT overwrite your dataframe and you'll need to use test_df = test_df.assign(offset_date=new_offset_date) to actually persist the new dates in the dataframe.

test_df.assign(offset_date=new_offset_date)

        date offset_date
0 2021-01-01  2021-01-01
1 2021-01-03  2021-01-04
2 2021-01-04  2021-01-05
3 2021-01-05  2021-01-05
4 2021-01-07  2021-01-07
5 2021-01-09  2021-01-10
6 2021-01-10  2021-01-11
7 2021-01-11  2021-01-11

回复收藏 0 原文

仅此而已 2025-01-25 11:16:54

以下方法应该适合您的情况

test_df["offset_date"] = np.where(
    test_df.offset_date.isin(test_df.date),
    test_df.offset_date,
    test_df.date
)

The bellow approach should works for your case

test_df["offset_date"] = np.where(
    test_df.offset_date.isin(test_df.date),
    test_df.offset_date,
    test_df.date
)

回复收藏 0 原文

~没有更多了~

关于作者

娇纵

暂无简介

文章

24 人气

关注发私信

友情链接

文江博客

熊猫用最后可用值替换列值

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

饮湿

明月

02

hs1283

风向决定发型

落花浅忆

友情链接

熊猫用最后可用值替换列值

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

饮湿

明月

02

hs1283

风向决定发型

落花浅忆

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。