pandas 填充数据框中给定的缺失时间间隔

发布于 2025-01-14 10:41:42 字数 1818 浏览 0 评论 0原文

我有一个数据框，如下所示：

gap_idspeciestime_starttime_stop1wheat2021-11-22	:	002fescue2021-12-1805	:
00	01 :	:	002021-11-2200:
52	002021-12-18	:	03 05:53:00

我想扩展 DataFrame 以便我对于每个 gap_id，获取与 time_start 和 time_stop 之间的分钟数一样多的行：

gap_id	物种	时间
1	小麦	2021-11-22 00： 01:00
1	小麦	2021-11-22 00:02:00
1	小麦	2021-11-22 00:03:00
2	fescue	2021-12-18 05:52:00
2	fescue	2021-12-18 05:53:00

我尝试过方法 pd.data_range 但我不知道如何将它与在 gap_id 上制作的 groupby

提前致谢

原文

I have a DataFrame looking like:

gap_id	species	time_start	time_stop
1	wheat	2021-11-22 00:01:00	2021-11-22 00:03:00
2	fescue	2021-12-18 05:52:00	2021-12-18 05:53:00

I would like to expand the DataFrame such that I get as many rows as the number of minutes between time_start and time_stop for each gap_id:

gap_id	species	time
1	wheat	2021-11-22 00:01:00
1	wheat	2021-11-22 00:02:00
1	wheat	2021-11-22 00:03:00
2	fescue	2021-12-18 05:52:00
2	fescue	2021-12-18 05:53:00

I've tried the method pd.data_range but I don't know how to couple it with a groupby made on gap_id

Thanks in advance

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

花海 2025-01-21 10:41:42

如果小 DataFrame 和性能并不重要，则为每行生成 date_range 然后使用 DataFrame.explode：

df['time'] = df.apply(lambda x: pd.date_range(x['time_start'], x['time_stop'], freq='T'), axis=1)
df = df.drop(['time_start','time_stop'], axis=1).explode('time')

print (df)
   gap_id species                time
0       1   wheat 2021-11-22 00:01:00
0       1   wheat 2021-11-22 00:02:00
0       1   wheat 2021-11-22 00:03:00
1       2  fescue 2021-12-18 05:52:00
1       2  fescue 2021-12-18 05:53:00

对于大型 DataFrame，首先以分钟为单位按差异 start 和 stop 列重复索引，然后按 GroupBy.cumcount通过 to_timedelta ：

df['time_start'] = pd.to_datetime(df['time_start'])
df['time_stop'] = pd.to_datetime(df['time_stop'])

df = (df.loc[df.index.repeat(df['time_stop'].sub(df['time_start']).dt.total_seconds() // 60 + 1)]
        .drop('time_stop', axis=1)
        .rename(columns={'time_start':'time'}))
       
td = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='Min')

df['time'] += td
df = df.reset_index(drop=True)
print (df)
   gap_id species                time
0       1   wheat 2021-11-22 00:01:00
1       1   wheat 2021-11-22 00:02:00
2       1   wheat 2021-11-22 00:03:00
3       2  fescue 2021-12-18 05:52:00
4       2  fescue 2021-12-18 05:53:00

If small DataFrame and performance is not important generate for each row date_range and then use DataFrame.explode:

df['time'] = df.apply(lambda x: pd.date_range(x['time_start'], x['time_stop'], freq='T'), axis=1)
df = df.drop(['time_start','time_stop'], axis=1).explode('time')

print (df)
   gap_id species                time
0       1   wheat 2021-11-22 00:01:00
0       1   wheat 2021-11-22 00:02:00
0       1   wheat 2021-11-22 00:03:00
1       2  fescue 2021-12-18 05:52:00
1       2  fescue 2021-12-18 05:53:00

For large DataFrames repeat indices by difference start and stop columns in minutes first and then add counter by GroupBy.cumcount with convert to timedeltas by to_timedelta :

df['time_start'] = pd.to_datetime(df['time_start'])
df['time_stop'] = pd.to_datetime(df['time_stop'])

df = (df.loc[df.index.repeat(df['time_stop'].sub(df['time_start']).dt.total_seconds() // 60 + 1)]
        .drop('time_stop', axis=1)
        .rename(columns={'time_start':'time'}))
       
td = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='Min')

df['time'] += td
df = df.reset_index(drop=True)
print (df)
   gap_id species                time
0       1   wheat 2021-11-22 00:01:00
1       1   wheat 2021-11-22 00:02:00
2       1   wheat 2021-11-22 00:03:00
3       2  fescue 2021-12-18 05:52:00
4       2  fescue 2021-12-18 05:53:00

回复收藏 0 原文

~没有更多了~