取最小值和最大值为空值 - pandas groupby
我有一个 pandas.DataFrame
,看起来像这样:
| id | start | end |
|:--:|:-----------------:|:----------------:|
| a | 1/1/20 12:00 AM | 1/2/20 12:00 AM |
| b | 1/1/20 6:37 PM | 1/2/20 7:11 PM |
| b | 1/4/20 1:17 AM | |
| c | 2/4/20 12:00 AM | 7/13/20 12:00 AM |
| d | 4/19/20 8:45 PM | 4/23/20 12:13 AM |
| d | 11/21/20 12:00 AM | 3/2/21 12:00 AM |
我试图为每个 id
确定 min()
start
和 max( )
结束
。我的问题是,在某些情况下,end
可以为空,在这种情况下,它应该算作最大值(即问题尚未结束)。
理想情况下,结果如下所示:
| id | start | end |
|:--:|:---------------:|:----------------:|
| a | 1/1/20 12:00 AM | 1/2/20 12:00 AM |
| b | 1/1/20 6:37 PM | |
| c | 2/4/20 12:00 AM | 7/13/20 12:00 AM |
| d | 4/19/20 8:45 PM | 3/2/21 12:00 AM |
我已经查看了这个问题 寻求灵感但尚未找到解决方案。
下面的MRE:
import pandas as pd, numpy as np
df = pd.DataFrame.from_dict({'id': {0: 'a', 1: 'b', 2: 'b', 3: 'c', 4: 'd', 5: 'd'}, 'start': {0: '1/1/20 12:00 AM', 1: '1/1/20 6:37 PM', 2: '1/4/20 1:17 AM', 3: '2/4/20 12:00 AM', 4: '4/19/20 8:45 PM', 5: '11/21/20 12:00 AM'}, 'end': {0: '1/2/20 12:00 AM', 1: '1/2/20 7:11 PM', 2: np.nan, 3: '7/13/20 12:00 AM', 4: '4/23/20 12:13 AM', 5: '3/2/21 12:00 AM'}})
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
starts = df.groupby('id')['start'].min().reset_index()
ends = df.groupby('id')['end'].max().reset_index()
_df = pd.merge(starts, ends, on='id')
但是,这会产生:
| id | start | end |
|:--:|:---------------:|:--------------:|
| a | 1/1/2020 0:00 | 1/2/2020 0:00 |
| b | 1/1/2020 18:37 | 1/2/2020 19:11 |
| c | 2/4/2020 0:00 | 7/13/2020 0:00 |
| d | 4/19/2020 20:45 | 3/2/2021 0:00 |
我怎样才能实现我想要的结果?
I have a pandas.DataFrame
, that looks like this:
| id | start | end |
|:--:|:-----------------:|:----------------:|
| a | 1/1/20 12:00 AM | 1/2/20 12:00 AM |
| b | 1/1/20 6:37 PM | 1/2/20 7:11 PM |
| b | 1/4/20 1:17 AM | |
| c | 2/4/20 12:00 AM | 7/13/20 12:00 AM |
| d | 4/19/20 8:45 PM | 4/23/20 12:13 AM |
| d | 11/21/20 12:00 AM | 3/2/21 12:00 AM |
I am trying to, for each id
, determine the min()
start
and the max()
end
. My issue is, in some cases, the end
can be blank and in that case, it should count as the max (i.e., the issue is not closed).
Ideally, the result would look like this:
| id | start | end |
|:--:|:---------------:|:----------------:|
| a | 1/1/20 12:00 AM | 1/2/20 12:00 AM |
| b | 1/1/20 6:37 PM | |
| c | 2/4/20 12:00 AM | 7/13/20 12:00 AM |
| d | 4/19/20 8:45 PM | 3/2/21 12:00 AM |
I have looked at this question for inspiration but have not found a solution.
MRE below:
import pandas as pd, numpy as np
df = pd.DataFrame.from_dict({'id': {0: 'a', 1: 'b', 2: 'b', 3: 'c', 4: 'd', 5: 'd'}, 'start': {0: '1/1/20 12:00 AM', 1: '1/1/20 6:37 PM', 2: '1/4/20 1:17 AM', 3: '2/4/20 12:00 AM', 4: '4/19/20 8:45 PM', 5: '11/21/20 12:00 AM'}, 'end': {0: '1/2/20 12:00 AM', 1: '1/2/20 7:11 PM', 2: np.nan, 3: '7/13/20 12:00 AM', 4: '4/23/20 12:13 AM', 5: '3/2/21 12:00 AM'}})
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
starts = df.groupby('id')['start'].min().reset_index()
ends = df.groupby('id')['end'].max().reset_index()
_df = pd.merge(starts, ends, on='id')
However, that produces:
| id | start | end |
|:--:|:---------------:|:--------------:|
| a | 1/1/2020 0:00 | 1/2/2020 0:00 |
| b | 1/1/2020 18:37 | 1/2/2020 19:11 |
| c | 2/4/2020 0:00 | 7/13/2020 0:00 |
| d | 4/19/2020 20:45 | 3/2/2021 0:00 |
How can I accomplish my desired result?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
IIUC,
DataFrame.mask
设置 NaN,其中每个组和 col 都有任何 nan详细信息:
在多列分组的情况下:
IIUC,
DataFrame.mask
to set NaN where there are any nan for each group and colDetail:
In the case of multiple columns to group by:
按日期排序并使用 iloc 获取最后一个值
Sort by dates and use
iloc
to get the last value