熊猫用最后可用值替换列值
对于可以使用下面代码派生的示例数据框架,我想更新列offset_date
,以便在列中的任何日期offset_date
不在列date中
我想在offset_date
中替换该日期,其中最后一个可用值date
。
data = {"date": ['2021-01-01', '2021-01-03', '2021-01-04', '2021-01-05',
'2021-01-07', '2021-01-09', '2021-01-10', '2021-01-11'],
"offset_date": ['2021-01-02', '2021-01-04', '2021-01-05',
'2021-01-06', '2021-01-08', '2021-01-10',
'2021-01-11', '2021-01-12']}
test_df = pd.DataFrame(data)
test_df['date'] = pd.to_datetime(test_df['date'])
test_df['offset_date'] = pd.to_datetime(test_df['offset_date'])
要在上述数据框架的第一行中进一步解释2021-01-02
不在列date
中,因此我想用列中的最后一个可用值替换该值>日期 IE 2021-01-01
。
我想执行一种矢量化方法,因此我尝试了以下操作,从而导致结果不正确。
test_df['offset_date_upd'] = np.where(test_df['offset_date'] in test_df['date'].values,
test_df['offset_date'],
test_df[test_df['date'] <= test_df['offset_date']].values.max())
如何使用矢量化方法获得以下所需的输出?
所需的输出
+------------+-------------+
| Date | Offset_Date |
+------------+-------------+
| 2021-01-01 | 2021-01-01 |
| 2021-03-01 | 2021-04-01 |
| 2021-04-01 | 2021-05-01 |
| 2021-05-01 | 2021-05-01 |
| 2021-07-01 | 2021-07-01 |
| 2021-09-01 | 2021-10-01 |
| 2021-10-01 | 2021-11-01 |
| 2021-11-01 | 2021-11-01 |
+------------+-------------+
For sample data frame that can be derived using code below, I want to update the column Offset_Date
such that for any date in column Offset_Date
that is not within column Date
I want to replace that date in Offset_Date
with last available value in column Date
.
data = {"date": ['2021-01-01', '2021-01-03', '2021-01-04', '2021-01-05',
'2021-01-07', '2021-01-09', '2021-01-10', '2021-01-11'],
"offset_date": ['2021-01-02', '2021-01-04', '2021-01-05',
'2021-01-06', '2021-01-08', '2021-01-10',
'2021-01-11', '2021-01-12']}
test_df = pd.DataFrame(data)
test_df['date'] = pd.to_datetime(test_df['date'])
test_df['offset_date'] = pd.to_datetime(test_df['offset_date'])
To explain further in 1st row of above data frame date 2021-01-02
is not within column date
so I want to replace that value with last available value in column date
i.e. 2021-01-01
.
I want to perform a vectorized approach so I tried the following, which lead to incorrect results.
test_df['offset_date_upd'] = np.where(test_df['offset_date'] in test_df['date'].values,
test_df['offset_date'],
test_df[test_df['date'] <= test_df['offset_date']].values.max())
How can I get the below desired output using a vectorized approach?
Desired Output
+------------+-------------+
| Date | Offset_Date |
+------------+-------------+
| 2021-01-01 | 2021-01-01 |
| 2021-03-01 | 2021-04-01 |
| 2021-04-01 | 2021-05-01 |
| 2021-05-01 | 2021-05-01 |
| 2021-07-01 | 2021-07-01 |
| 2021-09-01 | 2021-10-01 |
| 2021-10-01 | 2021-11-01 |
| 2021-11-01 | 2021-11-01 |
+------------+-------------+
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
IIUC,您可以使用
isin
、mask
和fillna
:输出:
IIUC, you can use
isin
,mask
, andfillna
:output:
这是pandas' 函数。
我们必须具体说明哪些列要去哪里。这将用作左联接,在这种情况下,我们希望
'offset_date'
表示“左”。然后,对于'offset_date'
中的每个值,我们在'date'
中寻找最大的值,该值在'offset_date中不超过该值>。
这种方法中的一个陷阱是两个列都需要分类。如果这是真实数据的问题,请告诉我,我将在底部添加一个处理此问题的部分。
现在,新的日期在
pandas.series
中,如果我们使用分配
方法, coptrite datafrite且您'll需要使用test_df = test_df.assign(offset_date = new_offset_date)
实际上持续了dataframe中的新日期。This is the purpose of Pandas'
merge_asof
function.We have to be specific about which columns are going where. This will work as a left join and in this case, we want
'offset_date'
to represent the "left". Then, for each value in'offset_date'
, we look for the greatest value in'date'
that does not exceed that value in'offset_date'
.The one gotcha in this approach is that both columns need to be sorted. If this is an issue with the real data, let me know and I'll add a section at the bottom that deals with this.
Now the new dates are in a
pandas.Series
, we can see it if we use theassign
method which DOES NOT overwrite your dataframe and you'll need to usetest_df = test_df.assign(offset_date=new_offset_date)
to actually persist the new dates in the dataframe.以下方法应该适合您的情况
The bellow approach should works for your case