取最小值和最大值为空值 - pandas groupby

发布于 2025-01-10 16:42:08 字数 2206 浏览 0 评论 0原文

我有一个 pandas.DataFrame,看起来像这样:

| id |       start       |        end       |
|:--:|:-----------------:|:----------------:|
|  a |  1/1/20 12:00 AM  |  1/2/20 12:00 AM |
|  b |   1/1/20 6:37 PM  |  1/2/20 7:11 PM  |
|  b |   1/4/20 1:17 AM  |                  |
|  c |  2/4/20 12:00 AM  | 7/13/20 12:00 AM |
|  d |  4/19/20 8:45 PM  | 4/23/20 12:13 AM |
|  d | 11/21/20 12:00 AM |  3/2/21 12:00 AM |

我试图为每个 id 确定 min() startmax( ) 结束。我的问题是,在某些情况下,end 可以为空,在这种情况下,它应该算作最大值(即问题尚未结束)。

理想情况下,结果如下所示:

| id |      start      |        end       |
|:--:|:---------------:|:----------------:|
|  a | 1/1/20 12:00 AM |  1/2/20 12:00 AM |
|  b |  1/1/20 6:37 PM |                  |
|  c | 2/4/20 12:00 AM | 7/13/20 12:00 AM |
|  d | 4/19/20 8:45 PM |  3/2/21 12:00 AM |

我已经查看了这个问题 寻求灵感但尚未找到解决方案。

下面的MRE

import pandas as pd, numpy as np

df = pd.DataFrame.from_dict({'id': {0: 'a', 1: 'b', 2: 'b', 3: 'c', 4: 'd', 5: 'd'}, 'start': {0: '1/1/20 12:00 AM', 1: '1/1/20 6:37 PM', 2: '1/4/20 1:17 AM', 3: '2/4/20 12:00 AM', 4: '4/19/20 8:45 PM', 5: '11/21/20 12:00 AM'}, 'end': {0: '1/2/20 12:00 AM', 1: '1/2/20 7:11 PM', 2: np.nan, 3: '7/13/20 12:00 AM', 4: '4/23/20 12:13 AM', 5: '3/2/21 12:00 AM'}})

df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])

starts = df.groupby('id')['start'].min().reset_index()
ends = df.groupby('id')['end'].max().reset_index()

_df = pd.merge(starts, ends, on='id')

但是,这会产生:

| id |      start      |       end      |
|:--:|:---------------:|:--------------:|
|  a |  1/1/2020 0:00  |  1/2/2020 0:00 |
|  b |  1/1/2020 18:37 | 1/2/2020 19:11 |
|  c |  2/4/2020 0:00  | 7/13/2020 0:00 |
|  d | 4/19/2020 20:45 |  3/2/2021 0:00 |

我怎样才能实现我想要的结果?

I have a pandas.DataFrame, that looks like this:

| id |       start       |        end       |
|:--:|:-----------------:|:----------------:|
|  a |  1/1/20 12:00 AM  |  1/2/20 12:00 AM |
|  b |   1/1/20 6:37 PM  |  1/2/20 7:11 PM  |
|  b |   1/4/20 1:17 AM  |                  |
|  c |  2/4/20 12:00 AM  | 7/13/20 12:00 AM |
|  d |  4/19/20 8:45 PM  | 4/23/20 12:13 AM |
|  d | 11/21/20 12:00 AM |  3/2/21 12:00 AM |

I am trying to, for each id, determine the min() start and the max() end. My issue is, in some cases, the end can be blank and in that case, it should count as the max (i.e., the issue is not closed).

Ideally, the result would look like this:

| id |      start      |        end       |
|:--:|:---------------:|:----------------:|
|  a | 1/1/20 12:00 AM |  1/2/20 12:00 AM |
|  b |  1/1/20 6:37 PM |                  |
|  c | 2/4/20 12:00 AM | 7/13/20 12:00 AM |
|  d | 4/19/20 8:45 PM |  3/2/21 12:00 AM |

I have looked at this question for inspiration but have not found a solution.

MRE below:

import pandas as pd, numpy as np

df = pd.DataFrame.from_dict({'id': {0: 'a', 1: 'b', 2: 'b', 3: 'c', 4: 'd', 5: 'd'}, 'start': {0: '1/1/20 12:00 AM', 1: '1/1/20 6:37 PM', 2: '1/4/20 1:17 AM', 3: '2/4/20 12:00 AM', 4: '4/19/20 8:45 PM', 5: '11/21/20 12:00 AM'}, 'end': {0: '1/2/20 12:00 AM', 1: '1/2/20 7:11 PM', 2: np.nan, 3: '7/13/20 12:00 AM', 4: '4/23/20 12:13 AM', 5: '3/2/21 12:00 AM'}})

df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])

starts = df.groupby('id')['start'].min().reset_index()
ends = df.groupby('id')['end'].max().reset_index()

_df = pd.merge(starts, ends, on='id')

However, that produces:

| id |      start      |       end      |
|:--:|:---------------:|:--------------:|
|  a |  1/1/2020 0:00  |  1/2/2020 0:00 |
|  b |  1/1/2020 18:37 | 1/2/2020 19:11 |
|  c |  2/4/2020 0:00  | 7/13/2020 0:00 |
|  d | 4/19/2020 20:45 |  3/2/2021 0:00 |

How can I accomplish my desired result?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

将军与妓 2025-01-17 16:42:08

IIUC,DataFrame.mask 设置 NaN,其中每个组和 col 都有任何 nan

new_df = \
df.groupby('id')\
  .agg({'start':'min', 'end':'max'})\
  .mask(df[['start', 'end']].isna()
                            .groupby(df['id'])
                            .max())\
  .reset_index()

print(new_df)
  id               start        end
0  a 2020-01-01 00:00:00 2020-01-02
1  b 2020-01-01 18:37:00        NaT
2  c 2020-02-04 00:00:00 2020-07-13
3  d 2020-04-19 20:45:00 2021-03-02

详细信息:

print(df[['start', 'end']].isna()
                            .groupby(df['id'])
                            .max())

    start    end
id              
a   False  False
b   False   True
c   False  False
d   False  False

在多列分组的情况下:

new_df = \
df.groupby(['id', 'status'])\
  .agg({'start':'min', 'end':'max'})\
  .mask(df[['start', 'end']].isna()
                            .groupby([df['id'], df['status']])
                            .max())\
  .reset_index()

IIUC,DataFrame.mask to set NaN where there are any nan for each group and col

new_df = \
df.groupby('id')\
  .agg({'start':'min', 'end':'max'})\
  .mask(df[['start', 'end']].isna()
                            .groupby(df['id'])
                            .max())\
  .reset_index()

print(new_df)
  id               start        end
0  a 2020-01-01 00:00:00 2020-01-02
1  b 2020-01-01 18:37:00        NaT
2  c 2020-02-04 00:00:00 2020-07-13
3  d 2020-04-19 20:45:00 2021-03-02

Detail:

print(df[['start', 'end']].isna()
                            .groupby(df['id'])
                            .max())

    start    end
id              
a   False  False
b   False   True
c   False  False
d   False  False

In the case of multiple columns to group by:

new_df = \
df.groupby(['id', 'status'])\
  .agg({'start':'min', 'end':'max'})\
  .mask(df[['start', 'end']].isna()
                            .groupby([df['id'], df['status']])
                            .max())\
  .reset_index()
伤感在游骋 2025-01-17 16:42:08

按日期排序并使用 iloc 获取最后一个值

df.sort_values(["start", "end"]).groupby("id").agg({"start": "first",
                                                    "end": lambda x: x.iloc[-1]})
#                  start        end
# id                               
# a  2020-01-01 00:00:00 2020-01-02
# b  2020-01-01 18:37:00        NaT
# c  2020-02-04 00:00:00 2020-07-13
# d  2020-04-19 20:45:00 2021-03-02

Sort by dates and use iloc to get the last value

df.sort_values(["start", "end"]).groupby("id").agg({"start": "first",
                                                    "end": lambda x: x.iloc[-1]})
#                  start        end
# id                               
# a  2020-01-01 00:00:00 2020-01-02
# b  2020-01-01 18:37:00        NaT
# c  2020-02-04 00:00:00 2020-07-13
# d  2020-04-19 20:45:00 2021-03-02
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文