构造不重叠的日期时间记录(开始、结束日期时间)数据框
我需要创建一个数据框,删除多个 id
的重叠 start
和 end
日期时间。我将使用 start
和 end
日期时间来聚合高频 pandas 数据框中的值,因此我需要删除 mst_df
中那些重叠的日期时间。
import pandas as pd
#Proxy reference dataframe
master = [['site a', '2021-07-08 00:00:00', '2021-07-08 10:56:00'],
['site a', '2021-07-08 06:00:00', '2021-07-08 12:00:00'], #slightly overlapping
['site a', '2021-07-08 17:36:00', '2021-07-09 11:40:00'],
['site a', '2021-07-08 18:00:00', '2021-07-09 11:40:00'], #overlapping
['site a', '2021-07-09 00:00:00', '2021-07-09 05:40:00'], #overlapping
['site b', '2021-07-08 00:00:00', '2021-07-08 10:24:00'],
['site b', '2021-07-08 06:00:00', '2021-07-08 10:24:00'], #overlapping
['site b', '2021-07-08 17:32:00', '2021-07-09 11:12:00'],
['site b', '2021-07-08 18:00:00', '2021-07-09 11:12:00'], #overlapping
['site b', '2021-07-09 00:00:00', '2021-07-09 13:00:00']] #slightly overlapping
mst_df = pd.DataFrame(master, columns = ['id', 'start', 'end'])
mst_df['start'] = pd.to_datetime(mst_df['start'], infer_datetime_format=True)
mst_df['end'] = pd.to_datetime(mst_df['end'], infer_datetime_format=True)
所需的数据框:
id start end
site a 2021-07-08 00:00:00 2021-07-08 12:00:00
site a 2021-07-08 17:36:00 2021-07-09 11:40:00
site b 2021-07-08 00:00:00 2021-07-08 10:24:00
site b 2021-07-08 17:32:00 2021-07-09 13:00:00
I need to create a dataframe removing over-lapping start
and end
datetimes for multiple ids
. I will use the start
and end
datetimes to aggregate values in a high frequency pandas dataframe, so I need to remove those overlapping datetimes in mst_df
.
import pandas as pd
#Proxy reference dataframe
master = [['site a', '2021-07-08 00:00:00', '2021-07-08 10:56:00'],
['site a', '2021-07-08 06:00:00', '2021-07-08 12:00:00'], #slightly overlapping
['site a', '2021-07-08 17:36:00', '2021-07-09 11:40:00'],
['site a', '2021-07-08 18:00:00', '2021-07-09 11:40:00'], #overlapping
['site a', '2021-07-09 00:00:00', '2021-07-09 05:40:00'], #overlapping
['site b', '2021-07-08 00:00:00', '2021-07-08 10:24:00'],
['site b', '2021-07-08 06:00:00', '2021-07-08 10:24:00'], #overlapping
['site b', '2021-07-08 17:32:00', '2021-07-09 11:12:00'],
['site b', '2021-07-08 18:00:00', '2021-07-09 11:12:00'], #overlapping
['site b', '2021-07-09 00:00:00', '2021-07-09 13:00:00']] #slightly overlapping
mst_df = pd.DataFrame(master, columns = ['id', 'start', 'end'])
mst_df['start'] = pd.to_datetime(mst_df['start'], infer_datetime_format=True)
mst_df['end'] = pd.to_datetime(mst_df['end'], infer_datetime_format=True)
Desired DataFrame:
id start end
site a 2021-07-08 00:00:00 2021-07-08 12:00:00
site a 2021-07-08 17:36:00 2021-07-09 11:40:00
site b 2021-07-08 00:00:00 2021-07-08 10:24:00
site b 2021-07-08 17:32:00 2021-07-09 13:00:00
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不知道
pandas
是否有特殊的功能。它有 Interval.overlaping() 来检查是否两个范围重叠(甚至可以使用datetime
),但我没有看到合并这两个范围的函数,因此它仍然需要自己的代码进行合并。幸运的是这很容易。行按
start
排序,因此当previous_end
previous_end
循环中使用它。时,行不会重叠。 next_start
我在 for但首先我按
site
分组,分别处理每个站点。接下来,我获取第一行(如
previous
),并与其他行(如next)
一起运行循环,并检查previous_end
previous_end
next_start
。如果它是
True
,那么我可以将previous
放在结果列表中,并将next
作为previous
与其余部分一起使用行数。如果它是
False
,那么我会从两行创建新范围并使用它来处理其余行。最后,我将
previous
添加到列表中。处理完所有组后,我将所有组都转换为 DataFrame。
结果:
I don't know if
pandas
has special function for this. It has Interval.overlaping() to check if two ranges overlaps (and it works even withdatetime
) but I don't see function to merge these two ranges so it still needs own code for merging. Fortunately it is easy.Rows are sorted by
start
so rows are not overlaping whenprevious_end < next_start
and I use it infor
-loop.But first I group by
site
to work with every site separatelly.Next I get first row (as
previous
) and run loop with other rows (asnext)
and checkprevious_end < next_start
.If it is
True
then I can putprevious
on list of results and getnext
asprevious
to work with rest of rows.If it is
False
then I create new range from both rows and use it to work with rest of rows.Finally I add
previous
to list.After processing all groups I convert all to DataFrame.
Result: