创建指标等于 1 的日期范围

发布于 2025-01-13 14:14:33 字数 1392 浏览 1 评论 0原文

如何为 ids 创建日期范围的数据框,其中 indicator = 1

#Proxy main high frequency dataframe
main_data = [['site a', '2021-03-05 01:00:00', 1], 
             ['site a', '2021-03-05 01:30:00', 1], 
             ['site a', '2021-03-05 02:00:00', 0],
             ['site a', '2021-03-05 02:30:00', 1],
             ['site a', '2021-03-05 02:30:00', 1],
             ['site b', '2021-04-08 20:00:00', 0],
             ['site b', '2021-04-09 20:00:00', 1],
             ['site b', '2021-04-10 20:00:00', 1],
             ['site b', '2021-04-10 20:30:00', 1]] 
 
# Create the pandas DataFrame
main_df = pd.DataFrame(main_data, columns = ['id', 'timestamp', 'indicator'])
main_df['timestamp'] = pd.to_datetime(main_df['timestamp'], infer_datetime_format=True)

print(main_df)
id  timestamp   indicator
0   site a  2021-03-05 01:00:00 1
1   site a  2021-03-05 01:30:00 1
2   site a  2021-03-05 02:00:00 0
3   site a  2021-03-05 02:30:00 1
4   site a  2021-03-05 02:30:00 1
5   site b  2021-04-08 20:00:00 0
6   site b  2021-04-09 20:00:00 1
7   site b  2021-04-10 20:00:00 1
8   site b  2021-04-10 20:30:00 1

所需的输出数据帧:

print(desired_df)

    id      start               end
0   site a  2021-03-05 01:00:00 2021-03-05 01:30:00
1   site a  2021-03-05 02:30:00 2021-03-05 02:30:00
2   site b  2021-04-09 20:00:00 2021-04-10 20:30:00

How do I create dataframe of date ranges for ids where indicator = 1?

#Proxy main high frequency dataframe
main_data = [['site a', '2021-03-05 01:00:00', 1], 
             ['site a', '2021-03-05 01:30:00', 1], 
             ['site a', '2021-03-05 02:00:00', 0],
             ['site a', '2021-03-05 02:30:00', 1],
             ['site a', '2021-03-05 02:30:00', 1],
             ['site b', '2021-04-08 20:00:00', 0],
             ['site b', '2021-04-09 20:00:00', 1],
             ['site b', '2021-04-10 20:00:00', 1],
             ['site b', '2021-04-10 20:30:00', 1]] 
 
# Create the pandas DataFrame
main_df = pd.DataFrame(main_data, columns = ['id', 'timestamp', 'indicator'])
main_df['timestamp'] = pd.to_datetime(main_df['timestamp'], infer_datetime_format=True)

print(main_df)
id  timestamp   indicator
0   site a  2021-03-05 01:00:00 1
1   site a  2021-03-05 01:30:00 1
2   site a  2021-03-05 02:00:00 0
3   site a  2021-03-05 02:30:00 1
4   site a  2021-03-05 02:30:00 1
5   site b  2021-04-08 20:00:00 0
6   site b  2021-04-09 20:00:00 1
7   site b  2021-04-10 20:00:00 1
8   site b  2021-04-10 20:30:00 1

Desired Output dataframe:

print(desired_df)

    id      start               end
0   site a  2021-03-05 01:00:00 2021-03-05 01:30:00
1   site a  2021-03-05 02:30:00 2021-03-05 02:30:00
2   site b  2021-04-09 20:00:00 2021-04-10 20:30:00

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

无畏 2025-01-20 14:14:33

IIUC:

  1. groupby 指标列序列并记录“start”和“end”的minmax 值。
  2. 删除不需要的列和重复项。
main_df["start"] = main_df.groupby(main_df["indicator"].ne(main_df["indicator"].shift()).cumsum())["timestamp"].transform("min")
main_df["end"] = main_df.groupby(main_df["indicator"].ne(main_df["indicator"].shift()).cumsum())["timestamp"].transform("max")

output = main_df[main_df["indicator"].eq(1)].drop_duplicates(["start", "end"])

>>> output
       id               start                 end
0  site a 2021-03-05 01:00:00 2021-03-05 01:30:00
3  site a 2021-03-05 02:30:00 2021-03-05 02:30:00
6  site b 2021-04-09 20:00:00 2021-04-10 20:30:00

IIUC:

  1. groupby sequences of the indicator column and record the min and max values for "start" and "end".
  2. drop unneeded columns and duplicates.
main_df["start"] = main_df.groupby(main_df["indicator"].ne(main_df["indicator"].shift()).cumsum())["timestamp"].transform("min")
main_df["end"] = main_df.groupby(main_df["indicator"].ne(main_df["indicator"].shift()).cumsum())["timestamp"].transform("max")

output = main_df[main_df["indicator"].eq(1)].drop_duplicates(["start", "end"])

>>> output
       id               start                 end
0  site a 2021-03-05 01:00:00 2021-03-05 01:30:00
3  site a 2021-03-05 02:30:00 2021-03-05 02:30:00
6  site b 2021-04-09 20:00:00 2021-04-10 20:30:00
乜一 2025-01-20 14:14:33

您可以将 grouby 与命名聚合一起使用,如下所示,首先创建指标 1、ind_grp 组,其中 eq 为零,cumsum

ind_grp = main_df['indicator'].eq(0).cumsum()

main_df.groupby(['id', ind_grp], as_index=False)\
       .agg(start=('timestamp', 'min'),
              end=('timestamp','max'))

输出:

       id               start                 end
0  site a 2021-03-05 01:00:00 2021-03-05 01:30:00
1  site a 2021-03-05 02:00:00 2021-03-05 02:30:00
2  site b 2021-04-08 20:00:00 2021-04-10 20:30:00

You can use grouby with named aggregrations like this, first create groups of indicators 1, ind_grp, with eq to zero and cumsum:

ind_grp = main_df['indicator'].eq(0).cumsum()

main_df.groupby(['id', ind_grp], as_index=False)\
       .agg(start=('timestamp', 'min'),
              end=('timestamp','max'))

Output:

       id               start                 end
0  site a 2021-03-05 01:00:00 2021-03-05 01:30:00
1  site a 2021-03-05 02:00:00 2021-03-05 02:30:00
2  site b 2021-04-08 20:00:00 2021-04-10 20:30:00
梦年海沫深 2025-01-20 14:14:33

这是一个解决方案:

group = main_df[main_df['indicator'] == 1].groupby(main_df['indicator'].ne(main_df['indicator'].shift(1)).cumsum()[main_df['indicator'] == 1])
pd.DataFrame({'id': group['id'].first().tolist(), 'start': group['timestamp'].first().tolist(), 'end': group['timestamp'].last().tolist()})

输出:

>>> desired_df
       id               start                 end
0  site a 2021-03-05 01:00:00 2021-03-05 01:30:00
1  site a 2021-03-05 02:30:00 2021-03-05 02:30:00
2  site b 2021-04-09 20:00:00 2021-04-10 20:30:00

Here's a solution:

group = main_df[main_df['indicator'] == 1].groupby(main_df['indicator'].ne(main_df['indicator'].shift(1)).cumsum()[main_df['indicator'] == 1])
pd.DataFrame({'id': group['id'].first().tolist(), 'start': group['timestamp'].first().tolist(), 'end': group['timestamp'].last().tolist()})

Output:

>>> desired_df
       id               start                 end
0  site a 2021-03-05 01:00:00 2021-03-05 01:30:00
1  site a 2021-03-05 02:30:00 2021-03-05 02:30:00
2  site b 2021-04-09 20:00:00 2021-04-10 20:30:00
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文